Imagination-Augmented Agents for Deep Reinforcement Learning

Are you a fan of the game chess? If I asked you to play chess, how would you play the game? Before moving any pieces on the chessboard, you might imagine the consequences of moving any piece and move the piece you think would help you to win. So, basically, before taking any action, you imagine the consequence and, if it is favorable, you proceed with that action, or else you refrain from performing that action.

Similarly, imagination augmented agents (I2A) are augmented with imagination; before taking any action in an environment they imagine the consequences of taking the action and, if they think the action will provide a good reward, they will perform the action. They also imagine the consequences of taking different action. Augmenting agents with imaginations is the next big step towards general artificial intelligence.

Now we will see how imagination augmented agents works in brief; I2A takes advantage of both model-based and model-free learning.

The architecture of I2A is as follows:

The action the agent takes is the result of both the model-based and model-free path. In the model-based path, we have something called rollout encoders; these rollout encoders are where the agent performs imagination tasks. Let’s take a closer look at rollout encoders. A rollout encoder is shown as follows:

Rollout encoders have two layers: imagine future and encoder. Imagine future is where the imagination happens. Look at the preceding diagram; the imagine future consists of the imagination core.

When feeding in the state, $o_t$, to the imagination core, we get the new state $\hat{\boldsymbol{O}}_{t+1}$ and the reward $\hat{\boldsymbol{r}}_{t+1}$, and when we feed this new state $\hat{\boldsymbol{O}}_{t+1}$ to the next imagination core we get the next new state $\hat{\boldsymbol{O}}_{t+2}$ and reward $\hat{\boldsymbol{r}}_{t+2}$ .

When we repeat these for some n steps we get a rollout which is basically a pair of states and rewards, and then we use encoders such as LSTM for encoding this rollout. As a result we get rollout encoding. These rollout encodings are actually the embeddings describing the future imagined path. We will have multiple rollout encoders for different future imagined paths and we use an aggregator for aggregating this rollout encoder.

Wait. How does the imagination happen in the imagination core? What is actually in the imagination core? A single imagination core is shown in the following diagram:

The imagination core consists of a policy network and environment model. The environment model is actually where everything happens. The environment model learns from all the actions that the agent has performed so far. It takes the information about the state $\hat{O}_{t} [$ and imagines all the possible futures considering the experience and chooses the action $\hat{a}_t$ which gives a high reward.

The architecture of I2A with all components expanded is shown as follows:

Have you played Sokoban before? Sokoban is a classic puzzle game where the player has to push boxes to a target location. The rules of the game are very simple: boxes can only be pushed and cannot be pulled. If we push a box in the wrong direction then the puzzle becomes unsolvable:

If we were asked to play Sokoban, then we imagine and plan before making any moves as bad moves lead to the end of the game. The I2A architecture will provide good results in these kinds of environments, where the agent has to plan in advance before taking any action. The authors of this paper tested I2A performance on Sokoban and achieved significant results.