Deep Reinforcement Learning from Human Preferences

Learning from human preference is a major breakthrough in Reinforcement learning (RL). The algorithm was proposed by researchers at OpenAI and DeepMind. The idea behind the algorithm is to make the agent learn according to human feedback. Initially, the agents act randomly and then two video clips of the agent performing an action are given to a human. The human can inspect the video clips and tell the agent which video clip is better, that is, in which video the agent is performing the task better and will lead it to achieving the goal. Once this feedback is given, the agent will try to do the actions preferred by the human and set the reward accordingly. Designing reward functions is one of the major challenges in RL, so having human interaction with the agent directly helps us to overcome the challenge and also helps us to minimize the writing of complex goal functions.

The training process is shown in the following diagram:

Deep Reinforcement Learning from Human Preferences
Deep Reinforcement Learning from Human Preferences

Let’s have a look at the following steps:

  1. First, our agent interacts with the environment through a random policy.
  2. The behavior of the agent’s interaction with the environment will be captured in a pair of two to three seconds of video clips and given to the human.
  3. The human will inspect the video clips and understand in which video clip the agent is performing better. They will send the result to the reward predictor.
  4. Now the agent will receive these signals from the reward predicted and set its goal and reward functions in line with the human’s feedback.

A trajectory is a sequence of observations and actions. We can denote the trajectory segment as \sigma , so \sigma=\left(\left(o_{0}, a_{0}\right),\left(o_{1}, a_{1}\right),\left(o_{2}, a_{2}\right) \dots\left(o_{k-1}, a_{k-1}\right)\right) , where o is the observation and a is the action. The agents receive an observation from the environment and perform some action.

Let’s say we will store this sequence of interactions in two trajectory segments, \sigma_1 and \sigma_2 . Now, these two trajectories are shown to the human. If the human prefers \sigma_2 to \sigma_1 , then the agent’s goal is to produce the trajectories preferred by the human, and the reward function will be set accordingly. These trajectory segments are stored in a database as \left(\sigma_{1}, \sigma_{2}, \mu\right) ;

If the human prefers \sigma_2 to \sigma_1 then \mu is set to prefer \sigma_2 . If none of the trajectories are preferred, then both will be removed from the database. If both are preferred, then \mu is set to a uniform.

You can check out the video at to see how the algorithm works.

Got any questions? Feel free to ask me in the comments section below.

Leave a Reply