Meta Imitation Learning (MIL)

In this post, we will learn how Imitation learning + Meta learning, that is meta imitation learning (MIL) works. If we want our robot to be more generalist and to perform various tasks then our robots should learn quickly. But how can we enable our robots to learn quickly? Well, how do we humans learn quickly? Don’t we learn new skills by just looking at other individuals easily?

Similarly, if we enable our robot to learn by just looking at our actions, then we can easily make the robot to learn complex goals efficiently and so we don’t have to engineer complex goal and reward functions. This type of learning i.e Learning from human actions is called imitation learning where the robot tries to mimic human action. A robot doesn’t really have to learn only from human actions, it can also learn from another robot performing a task or a video of human/robot performing a task.

But imitation learning is not as simple as it sounds.

A robot will take a lot of time and demonstrations to learn the goal and to identify the right policy. So we will augment the robot with prior experience as demonstrations(training data) so that it does not want to learn each skill completely from scratch. Augmenting the robot with prior experience helps it to learn quickly. So to learn several skills we need to collect demonstrations for each of those skills i.e we need to augment the robots with task-specific demonstration data.

But how can we enable our robot to learn quickly from a single demonstration for a task? Can we use meta learning here? Can we reuse the demonstration data and learn from several related tasks to learn the new task quickly? So we combine meta learning and imitation learning called meta imitation learning (MIL). So with meta imitation learning, we can make use of demonstration data from a variety of other tasks to learn a new task quickly with just one demonstration. So we can find the right policy for a new task with just one demonstration of that task.

We can use any of the meta learning algorithms we have seen. We use MAML as our meta learning algorithm which is compatible with any algorithm that can be trained with gradient descent. We use policy gradients as our algorithm for finding the right policy. In policy gradients, we directly optimize the parameterized policy \pi_{\theta} with some parameter \theta. We use MAML for finding this right parameter \theta so that our robot can learn quickly with a few demonstrations.

Our goal is to learn a policy that can quickly adapt to new tasks from a single demonstration of that task. By doing so, we can remove our dependency on a large amount of demonstration data for each of the task. What is actually our task here? Our task will contain the trajectories. A trajectory consists of a sequence of observations and actions from the expert policy which is the demonstrations. Wait. What is an expert policy? Since we are performing imitation learning, we are learning from the experts (human actions) so we call our policy as expert policy and it is denoted by \pi^*

trajectory = { o_t, a_1, ……, o_t,a_t} \sim \pi_i^*

Okay, what should be our loss function? Loss function denotes how our robot actions differ from expert actions. We can use mean squared as our loss function for continuous actions and cross-entropy as a loss function for discrete actions. Let us say we have continuous actions and mean squared error can be represented as,

L_{T_i}(f_{\theta}) = \sum_{tr^{(j)} \sim T_i} \sum_t ||f_{\theta}(o_t)^{(j)} - a_t^{(j)} ||_2^2

Say we have a distribution over tasks p(T) . We sample some batch of tasks and for each task T_i, we sample some demonstration data train the network by minimizing loss and find the optimal parameter \theta'. Next we perform we perform meta optimization by calculating meta gradients and find the optimal initial parameter \theta. We will see how exactly this work in the upcoming blog posts.

Leave a Reply