Before understanding Trust Region Policy Optimization (TRPO), we need to understand constrained policy optimization. We know that in reinforcement learning, agents learn by trial and error to maximize the reward. To find the best policy, our agents will explore all different actions and choose the one that gives a good reward. While exploring different actions there is a very good chance that our agents will explore bad actions as well. But the biggest challenge is when we allow our agents to learn in the real world and when the reward functions are not properly designed.

For example, consider an agent learning to walk without hitting any obstacles. The agent will receive a negative reward if it gets hit by any obstacle and a positive reward for not getting hit by any obstacle. To figure out the best policy, the agent explores different actions. The agent also takes action, such as hitting an obstacle to check whether it gives a good reward. But that is not safe for our agent; it is particularly unsafe when an agent is learning in a real-world environment. So we introduce constraint-based learning. We set a threshold and if the probability of hitting the obstacle is less than this threshold, then we consider our agent safe, or else we consider our agent unsafe. A constraint is added to make sure that our agent is in a safe region.

In TRPO, we iteratively improve the policy and we impose a constraint such that the Kullback–Leibler (KL) divergence between an old policy and a new policy is to be less than some constant. This constraint is called the trust region constraint.

So what is KL divergence? KL divergence tells us how two probability distributions are different from each other. Since our policies are probability distribution over actions, KL divergence tells us how far a new policy is from the old policy. Why do we have to keep the distance between the old policy and new policy less than some constant? Because we don’t want our new policy to drift apart from the old policy. So we impose a constraint to keep the new policy near to the old policy.

Again, why do we have to stay near the old policy? When the new policy is far away from the old policy, then it will affect our agent’s learning performance and also lead to a completely different learning behavior. In a nutshell, in TRPO, we take a step toward the direction that improves our policy, that is, maximizes the reward, but we should also be sure that the trust region constraint is satisfied. It uses conjugate gradient descent to optimize the network parameter while satisfying the constraint. The algorithm guarantees monotonic policy improvement and has also achieved excellent results in various continuous environments.

Now we will see how TRPO works mathematically;

Let ‘s specify the total expected discounted reward \eta(\pi) , as follows:

** \eta(\pi)=\mathbf{E}_{s_{0}, a_{0}, \ldots}\left[\sum_{t-0}^{\infty} \gamma^{\iota} r\left(s_{t}\right)\right] **

Now let’s consider the new policy as \pi ; it can be defined as the expected return of policy \pi' in terms of advantages over our old policy \pi , as follows:

** \eta\left(\pi^{\prime}\right)=\eta(\pi)+\mathbf{E}_{s_{0}, a_{0}, \dots \pi^{\prime}}\left[\sum_{t=0}^{\infty} \gamma^{t} A_{\pi}\left(s_{t}, a_{t}\right)\right]**

Okay, why are we using the advantages of the old policy? Because we are measuring how good the new policy \pi' is with respect to the average performance of the old policy \pi . We can rewrite the preceding equation with a sum over states instead of timesteps as follows:

\rho is the discounted visitation frequencies, that is:

** \rho_{\pi}(s)=P\left(s_{0}=s\right)+\gamma P\left(s_{1}=s\right)+\gamma^{2} P\left(s_{2}=s\right)+\ldots **

If you see the preceding equation \eta(\pi') there is a complex dependency of \rho_{\pi^{\prime}}(s) on \pi' and so it is difficult to optimize the equation. So we will introduce the local approximation L_{\pi}\left(\pi^{\prime}\right) to \eta(\pi') as follows:

** L_{\pi}\left(\pi^{\prime}\right)=\eta(\pi)+\sum_{s} \rho_{\pi}(s) \sum_{a} \pi^{\prime}(a | s) A_{\pi}(s, a) **

L_{\pi} uses the visitation frequency \rho_{\pi} rather than \rho_{\pi'} , that is, we ignore the changes in state visitation frequency due to the change in policy. To put it in simple terms, we assume that the state visitation frequency is not different for both the new and old policy. When we are calculating the gradient of L_{\pi} , which will also improve \eta with respect to some parameter\theta we can’t be sure how much big of a step to take.

Kakade and Langford proposed a new policy update method called conservative policy iteration, shown as follows:

** \pi_{n e w}(a | s)=(1-\alpha) \pi_{o l d}(a | s)+\alpha \pi^{\prime}(a | s) ** —— (1)

- \pi_{new} is the new policy.
- \pi_{old} is the old policy.
- \pi^{\prime}=\operatorname{argmax}{\pi^{\prime}} L{\pi_{o l d}}\left(\pi^{\prime}\right) , that is, \pi' , is the policy which maximizes L_{\pi_{o l d}}

Kakade and Langford derived the following equation from (1) as follows:

** \eta\left(\pi^{\prime}\right) \geq L_{\pi}\left(\pi^{\prime}\right)-C D_{K L}^{\max }\left(\pi, \pi^{\prime}\right) ——-(2) **

C is the penalty coefficient and it is equal to \frac{4 \epsilon \gamma}{(1-\alpha)^{2}} ,and D_{K L}^{\max }\left(\pi, \pi^{\prime}\right) denotes the KL divergence between the old policy and the new policy.

If we look at the preceding equation (2) closely, we notice that our expected long-term reward \eta increases monotonically as long as the right-hand side is maximized. Let’s define this right-hand side term as , as follows:

Let’s define this right-hand side term as M_{i}(\pi) , as follows:

** M_{i}(\pi)=L_{\pi_{i}}(\pi)-C D_{K L}^{\max }\left(\pi_{i}, \pi\right) —– (3) **

Substituting equation (3) in (2), we get:

** \eta\left(\pi_{i}+1\right) \geq M_{i}\left(\pi_{i}+1\right) —— (4) **

Since we know that the KL divergence between the two same policies will be 0, we can write:

** \eta\left(\pi_{i+1}\right)-\eta(\pi) \geq M_{i}\left(\pi_{i+1}\right)-M\left(\pi_{i}\right) **

Combining equations (4) and (5), we can write:

** \eta\left(\pi_{i+1}\right)-\eta(\pi) \geq M_{i}\left(\pi_{i+1}\right)-M\left(\pi_{i}\right) —— (5) **

In the preceding equation, we can understand that maximizing M_i guarantees the maximization of our expected reward. So now our goal is to maximize M_i which in turn maximizes our expected reward. Since we use parameterized policies, we replace \pi with \theta in our previous equation and we use \theta_{old} to represent a policy that we want to improve, as shown next:

** \operatorname{maximize}{\theta} \qquad\left[L{\theta_{old}}(\theta)-C D_{K L}^{\max }\left(\theta_{o l d}, \theta\right)\right] **

But having a penalty coefficient C in the preceding equation will cause the step size to be very small, which in turn slows down the updates. So, we impose a constraint on the KL divergence’s old policy and new policy, which is the trust region constraint, and it will help us to find the optimal step size:

Now, the problem is KL divergence is imposed on every point in the state space and it is really not feasible to solve when we have a high dimensional state space. So we use a heuristic approximation which takes the average KL divergence as:

** \overline{D}{K L}^{\rho}\left(\theta_{o l d}, \theta\right) :=\mathbf{E}{s \sim \rho}\left[D{K L}\left(\pi_{\theta_{1}}( . | s) | \pi_{\theta_{2}}( . | s)\right)\right] **

So now, we can rewrite our preceding objective function with the average KL divergence constraint as:

Expanding the value of **L**, we get the following:

In the preceding equation, we replace sum over states \sum_{s} \rho \theta_{o l d} as expectation E_{s \sim \rho \theta_{o l d}} and we replace sum over actions by importance sampling estimator as:

** \sum_{a} \pi_{\theta}\left(a | s_{n}\right) A_{\theta_{o d d}}\left(s_{n}, a\right)=E_{a \sim q}\left[\frac{\pi_{\theta}\left(a | s_{n}\right)}{q\left(a | s_{n}\right)} A_{\theta_{o d d}}\left(s_{n}, a\right)\right] **

Then, we substitute advantage target values A_{\theta_{o l d}} with Q values Q_{\theta_{o l d}} . So, our final objective function will become:

Optimizing the preceding mentioned objective function, which has a constraint, is called constrained optimization. Our constraint is to keep the average KL divergence between the old policy and new policy less than \rho We use conjugate gradient descent for optimizing the preceding function.