Now that we have a basic understanding of Model agnostic meta learning (MAML) from the previous post, we explore more about them in detail.

Let us say we have a model f parameterized by \theta i.e f_{\theta}() and we have a distribution over tasks p(T) . First, we initialize our parameter \theta with some random values. Next, we sample some batch of tasks T_i from a distribution over tasks. i.e .T_i \sim p(T). Let us say we have sampled tasks 5 tasks, T = {{ T_1, T_2, T_3,T_4,T_5} } then for each task T_i we sample K data points and train the model. We train the model by computing the loss L_{T_i}(f_{\theta}) and we minimize the loss using gradient descent and find the optimal set of parameters that minimize the loss. i.e,

** \theta'_i = \theta - \alpha \nabla_{\theta} L_{T_i}(f_{\theta})**

where,

- \theta'_i the optimal parameter for a task T_i
- \theta is the initial parameter
- \alpha is the hyperparameter
- \nabla_{\theta} L_{T_i}(f_{\theta}) is the gradient of a task T_i .

So after the above gradient update, we will have optimal parameters \theta' for all the five tasks which we have sampled i.e. \theta' ={ {\theta'_1, \theta'_2, \theta'_3, \theta'_4, \theta'_5}}

Now, before sampling the next batch of tasks, we perform meta update or meta optimization. i.e In the previous step, we found the optimal parameter \theta'_i by training on each of the tasks T_i . Now we calculate gradient with respect to these optimal parameters \theta'_i and update our randomly initialized parameter \theta by training on a new set of tasks T_i. This makes us our randomly initialized parameter \theta to move to an optimal position where we don’t have to take many gradient steps while training on the next batch of tasks. This step is called meta step or meta update or meta optimization or meta training. It can be expressed as,

**\theta = \theta - \beta \nabla_{\theta} \sum_{T_i \sim p(T)} L_{T_i}(f_{\theta_i}) **

where ,

- \theta is our initial parameter
- \beta is the hyperparameter
- \nabla_{\theta} \sum_{T_i \sim p(T)} L_{T_i}(f_{\theta_i}) is the gradient for each of the new task T_i with respect to parameter \theta_i'.

If you look at our above meta update equation closely, we can notice that we are updating our model parameter \theta by merely taking an average of gradients of each new task T_i with optimal parameter \theta_i'.

The overall algorithm of MAML is shown in the below figure, our algorithm consists of two loops, inner loop where we find the optimal parameter \theta_i' for each of the task T_i and outer loop where we update our randomly initialized model parameter \theta by calculating gradients with respect to the optimal parameters \theta_i'. in a new set of tasks T_i. __We should always remember that we should not use the same set of tasks T_i with which we find the optimal parameter \theta_i' while updating the model parameter \theta in the outer loop.

So, in a nutshell, in MAML, we sample some batch of task and for each task T_i in the batch, we minimize the loss using gradient descent and get the optimal parameter \theta_i' . Then before sampling another batch of tasks, we update our randomly initialized model parameter \theta by calculating gradients with respect to the optimal parameters \theta_i' in a new set of tasks T_i.

In the next post, we will learn how to use MAML in supervised learning.

**Got any question? Ask me in the comments section below. **