# Part 2: MAML in simple terms

Now that we have a basic understanding of Model agnostic meta learning (MAML) from the previous post, we explore more about them in detail.

Let us say we have a model $f$ parameterized by $\theta$ i.e $f_{\theta}()$ and we have a distribution over tasks $p(T)$. First, we initialize our parameter $\theta$ with some random values. Next, we sample some batch of tasks $T_i$ from a distribution over tasks. i.e .$T_i \sim p(T)$. Let us say we have sampled tasks 5 tasks, $T = {{ T_1, T_2, T_3,T_4,T_5} }$ then for each task $T_i$ we sample K data points and train the model. We train the model by computing the loss $L_{T_i}(f_{\theta})$ and we minimize the loss using gradient descent and find the optimal set of parameters that minimize the loss. i.e,

$\theta'_i = \theta - \alpha \nabla_{\theta} L_{T_i}(f_{\theta})$

where,

• $\theta'_i$ the optimal parameter for a task $T_i$
• $\theta$ is the initial parameter
• $\alpha$ is the hyperparameter
• $\nabla_{\theta} L_{T_i}(f_{\theta})$ is the gradient of a task $T_i$ .

So after the above gradient update, we will have optimal parameters $\theta'$ for all the five tasks which we have sampled i.e. $\theta' ={ {\theta'_1, \theta'_2, \theta'_3, \theta'_4, \theta'_5}}$

Now, before sampling the next batch of tasks, we perform meta update or meta optimization. i.e In the previous step, we found the optimal parameter $\theta'_i$ by training on each of the tasks $T_i$ . Now we calculate gradient with respect to these optimal parameters $\theta'_i$ and update our randomly initialized parameter $\theta$ by training on a new set of tasks $T_i$. This makes us our randomly initialized parameter $\theta$ to move to an optimal position where we don’t have to take many gradient steps while training on the next batch of tasks. This step is called meta step or meta update or meta optimization or meta training. It can be expressed as,

$\theta = \theta - \beta \nabla_{\theta} \sum_{T_i \sim p(T)} L_{T_i}(f_{\theta_i})$

where ,

• $\theta$ is our initial parameter
• $\beta$ is the hyperparameter
• $\nabla_{\theta} \sum_{T_i \sim p(T)} L_{T_i}(f_{\theta_i})$ is the gradient for each of the new task $T_i$ with respect to parameter $\theta_i'$.

If you look at our above meta update equation closely, we can notice that we are updating our model parameter $\theta$ by merely taking an average of gradients of each new task $T_i$ with optimal parameter $\theta_i'$.

The overall algorithm of MAML is shown in the below figure, our algorithm consists of two loops, inner loop where we find the optimal parameter $\theta_i'$ for each of the task $T_i$ and outer loop where we update our randomly initialized model parameter $\theta$ by calculating gradients with respect to the optimal parameters $\theta_i'$. in a new set of tasks $T_i$. __We should always remember that we should not use the same set of tasks $T_i$ with which we find the optimal parameter $\theta_i'$ while updating the model parameter $\theta$ in the outer loop.

So, in a nutshell, in MAML, we sample some batch of task and for each task $T_i$ in the batch, we minimize the loss using gradient descent and get the optimal parameter $\theta_i'$. Then before sampling another batch of tasks, we update our randomly initialized model parameter $\theta$ by calculating gradients with respect to the optimal parameters $\theta_i'$ in a new set of tasks $T_i$.

In the next post, we will learn how to use MAML in supervised learning.