In this post, we will look into the very popular off-policy TD control reinforcement learning algorithm called Q learning. Q learning is a very simple and widely used TD (temporal difference) algorithm. In control algorithms, we don’t care about state value; here, in Q learning, our concern is the state-action value pair—the effect of performing an action A in the state S.

We will update the Q value based on the following equation:

**Q(s, a)=Q(s, a)+\alpha\left(r+\gamma \max Q\left(s^{\prime} a^{\prime}\right)-Q(s, a)\right)**

The steps involved in Q learning are as follows:

- First, we initialize the Q function to some arbitrary values
- We take an action from a state using epsilon-greedy policy (\epsilon>0) and move it to the new state
- We update the Q value of a previous state by following the update rule: Q(s, a)=Q(s, a)+\alpha\left(r+\gamma \max Q\left(s^{\prime} a^{\prime}\right)-Q(s, a)\right)
- We repeat the steps 2 and 3 till we reach the terminal state.

Now, we will understand the algorithm using different steps. Consider the same frozen lake example. Let us say we are in a state (3,2) and have two actions (left and right). Now let us refer to the figure and compare it with epsilon-greedy policy:

In Q Learning, we select an action using the epsilon-greedy policy. We either explore a new action with the probability epsilon or we select the best action with a probability 1- epsilon. Let us say we select a probability epsilon and explore a new action Down and we select that action:

Now that we have performed a downward action in the sate (3,2) and reached a new state (4,2) using the epsilon-greedy policy, how do we update the value of the previous state (3,2) using our update rule? It is very simple. Look at the Q table shown as following:

Let us consider alpha as 0.1 and the discount factor as 1:

**Q(s, a)=Q(s, a)+\alpha\left(r+\gamma \max Q\left(s^{\prime} a^{\prime}\right)-Q(s, a)\right)**

**Q((3,2) , \text { down })=Q((3,2), \text { down })+0.1(0.3+1 \max [Q((4,2) , \text { action })]-Q((3,2), \text { down })**

We can say the value of a state (3,2) with a downward action, as in Q( (3,2), down), is 0.8 in the Q table.

What is max Q ( (4,2), action) for the state (4,2)? We have explored only three actions (up, down, and right) so we will take the maximum value only based on these actions. (Here, we will not perform epsilon greedy policy; we simply select the action that has the maximum value.)

So, based on the previous Q table, we can substitute the values as:

**Q( (3,2), down) = 0.8 + 0.1 ( 0.3 + 1 max [0.3, 0.5, 0.8] – 0.8 ) = 0.8 + 0.1 ( 0.3 + 1 (0.8) – 0.8) = 0.83**

So, we update the value of Q ((3,2), down) to 0.83.

Remember that while choosing what action to take, we perform the epsilon-greedy policy: we either explore for new actions with a probability epsilon or take an action which has a maximum value with a probability 1-epsilon. While updating the Q value, we don’t perform the epsilon-greedy policy, we simply select the action that has a maximum value

Now that we are in a state (4,2), we have to perform an action. What action should we perform? We decide that based on the epsilon-greedy policy, we either explore a new action with a probability epsilon or select the best action with a probability 1-epsilon. Let us say we select a probability 1-epsilon and select the best action. So, in the (4,2) the action right has a maximum value. So we will select the right action:

Now we are in a state (4,3) as we took a right action on the state (4,2). How do we update the value of the previous state? Like so:

**Q( (4,2), right) = Q( (4,2), right ) + 0.1 ( 0.3 + 1 max [Q( (4,3) action) ]- Q( (4,2), right)**

If you look at the Q table that follows, for the state (4,3) we have explored only two actions (up and down) so we will take a maximum value only based on these actions. (Here, we will not perform an epsilon-greedy policy; we simply select the action which has maximum value):

**Q ( (4,2), right) = Q((4,2),right) + 0.1 (0.3 + 1 max [ (Q (4,3), up) , ( Q(4,3),down) ] – Q ((4,2), right )**

** Q ( (4,2), right) = 0.8 + 0.1 (0.3 + 1 max [ 0.1,0.3] – 0.8) = 0.8 + 0.1 (0.3 + 1(0.3) – 0.8) = 0.78**

Look at the following Q table:

Now we update the value of the state Q((4,2), right) as 0.78. So, this is how we get the state-action values in Q learning. To decide what action to take, we use the epsilon-greedy policy and while updating the Q value we simply pick up the maximum action; here’s a flowchart: