Model-Agnostic Meta-Learning Algorithm — Case Study

7 min readNov 24, 2023

Fig 01: Meta-learning algorithm (MAML). Extracted from [1] section 2.1.

Meta-learning encapsulates an approach where models delve into the nuances of learning itself, understanding how to rapidly adapt to and comprehend new tasks. Conversely, few-shot learning stands as a technique employed to train models using minimal data.

The primary objective of few-shot meta-learning is to cultivate a model’s ability to swiftly acclimate to fresh tasks, relying solely on a scant amount of data points and training iterations. To achieve this, the model requires expedited training with a limited dataset. Throughout the process of meta-learning, the model undergoes training aimed at enabling it to adapt adeptly, not only to a diverse range of tasks encountered during training but also to new, previously unseen tasks.

Let we Take Brief Understanding on How A Few — Short Learning is Working and Try to Produce Output result.

The initial phase involves training a network with a vast amount of entirely different dataset. When a new image, referred to as the query image, is introduced, It compares this query image with a support set of images for the similarity. This comparison process serves as a foundational step in the network’s ability to discern and generalize patterns across tasks.

Fig 02: shows the Support set, Query set and Training Set Data Samples

In the context of few-shot learning, the support set comprises K classes, referred to as K-way, each having N samples, denoted as N-shot. To extract features from this support set, a CNN or a Siamese network is commonly employed. For instance, in a scenario of a 3-way 2-shot support set, a CNN can serve as the feature extractor, capturing distinctive features from this configuration.

These extracted features are subsequently compared with those obtained from the query set. The query set involves feature extraction from a pre-trained CNN. The comparison is executed by measuring distances between the features, aiming to identify the smallest distance, which signifies the prediction of the same class as the query image.

It’s important to note that during this process, the weights and biases of the network are trained from the support set, utilizing a specific number of tasks instead of relying on an extensive dataset. This approach allows the network to generalize and adapt more effectively to new tasks by leveraging the learned representations from a variety of task-based inputs.”

This revised explanation aims to clarify the process of using a support set in few-shot learning, incorporating K-way and N-shot configurations while emphasizing the utilization of pre-trained networks and task-based training for improved adaptation

Fig 03: Support Set which uses 4-way 2-shot

Fig 04: Similarity Score between Support set and Query Image

In this Article we will discuss the following

Introduction to MAML
Basic Code Implementation of MAML

A Model-Agnostic Meta-Learning Algorithm:

Today, we’re exploring a method that helps models quickly learn and adapt. Imagine if a model could pick up useful tricks that apply to lots of different tasks instead of just one specific job. That’s the idea here.

We’re talking about a bunch of different tasks(simply said each set of batches), let’s call it p(T), that our model needs to handle. We want this model to get better at new tasks without getting too caught up in one way of doing things. So, we’re trying to find the right settings in the model that really make a big difference when it’s faced with a new task from p(T).

Now, the technical part: the model is like a machine that we can adjust using some settings called parameters θ. When it faces a new task, it tweaks these settings a bit to θi′ We figure out these new settings by using a kind of maths called gradient descent, which helps us make the model better at the new task without forgetting what it already knows. It’s like adjusting a few knobs to make the machine work better for a new job.

This approach aims to find just the right settings in the model that are super responsive to different tasks. So even small adjustments to these settings can make a big improvement when the model deals with any task from p(T).

Let’s break down how this algorithm works with a simple example.

Our model f is defined by parameters θ, and we’re dealing with a distribution of tasks represented by p(T). To start, we initialize hyperparameters α and β, and then randomly set θ for the first time.

The goal of this method is to optimize the model’s parameters so that when faced with a new task, just a few steps using gradients will make the model perform at its best for that task. From the distribution p(T), we gather batches of tasks. Within each batch, there are several tasks (T1,T2,T3,…,Tn), and for each task Ti, there are K data points to use for training our model.

We use for classification task cross entropy as loss function:

Fig 05: Equation for Loss Function. Extracted from [1] Section 3.1

Then update the gradient of that loss function using bellow formula:

Fig 06: Task Based Gradient update. Extracted from [1] Section 2.2

It will update the parameter θ as θ’ for each task. finally, we got at the end of each batch of tasks updated θi’. Now before sampling the next batch of tasks, we perform meta update. Using below formula:

Fig 07: Meta update of parameter θ for each batch. Extracted from [1] Section 2.2

Here the updated θi’ for each task will feed into model f as parameter θi’. And the gradient will be performed. Then the next batches of tasks will be sampled and the process will be go so on. Below picture illustrates the the working of Algorithm

Fig 08: Workflow of MAML. Extracted from [2].

Below is a code snippet demonstrating the working principle of the MAML Algorithm:

class MAML(object):
    def __init__(self):
        
        #initialize number of tasks i.e number of tasks we need in each batch of tasks
        self.num_tasks = 10
        
        #number of samples i.e number of shots  -number of data points (k) we need to have in each task
        self.num_samples = 10

        #number of epochs i.e training iterations
        self.epochs = 10000
        
        #hyperparameter for the inner loop (inner gradient update)
        self.alpha = 0.0001
        
        #hyperparameter for the outer loop (outer gradient update) i.e meta optimization
        self.beta = 0.0001
       
        #randomly initialize our model parameter theta
        self.theta = np.random.normal(size=50).reshape(50, 1)
      
    #define our sigmoid activation function  
    def sigmoid(self,a):
        return 1.0 / (1 + np.exp(-a))
    
    
    #now let us get to the interesting part i.e training :P
    def train(self):
        
        #for the number of epochs,
        for e in range(self.epochs):        
            
            self.theta_ = []
            
            #for task i in batch of tasks
            for i in range(self.num_tasks):
               
                #sample k data points and prepare our train set
                XTrain, YTrain = sample_points(self.num_samples)
                
                a = np.matmul(XTrain, self.theta)

                YHat = self.sigmoid(a)

                #since we are performing classification, we use cross entropy loss as our loss function
                loss = ((np.matmul(-YTrain.T, np.log(YHat)) - np.matmul((1 -YTrain.T), np.log(1 - YHat)))/self.num_samples)[0][0]
                
                #minimize the loss by calculating gradients
                gradient = np.matmul(XTrain.T, (YHat - YTrain)) / self.num_samples

                #update the gradients and find the optimal parameter theta' for each of tasks
                self.theta_.append(self.theta - self.alpha*gradient)
                
     
            #initialize meta gradients
            meta_gradient = np.zeros(self.theta.shape)
                        
            for i in range(self.num_tasks):
            
                #sample k data points and prepare our test set for meta training
                XTest, YTest = sample_points(10)

                #predict the value of y
                a = np.matmul(XTest, self.theta_[i])
                
                YPred = self.sigmoid(a)
                           
                #compute meta gradients
                meta_gradient += np.matmul(XTest.T, (YPred - YTest)) / self.num_samples

  
            #update our randomly initialized model parameter theta with the meta gradients
            self.theta = self.theta-self.beta*meta_gradient/self.num_tasks
                                       
            if e%1000==0:
                print "Epoch {}: Loss {}\n".format(e,loss)             
                print 'Updated Model Parameter Theta\n'
                print 'Sampling Next Batch of Tasks \n'
                print '---------------------------------\n'

I’ve attached my GitHub repository here for your reference. It showcases the evaluation of the Mini-ImageNet dataset classification using Few-shot Learning with the MAML Algorithm. I plan to provide a detailed explanation of this code in my upcoming article.

And that’s a wrap! I would greatly appreciate your valuable feedback on my content. This marks my debut in tackling advanced Deep Learning concepts.”

This revision maintains the clarity of your message while adjusting some phrasing for smoother flow and accuracy. Overall, it seems like an exciting first dive into advanced Deep Learning concepts.

References

[1] Chelsea Finn, Pieter Abbeel and Sergey Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.

[2] Sudharsan Ravichandiran. Hands on Meta Learning with Python.

[3] Stanford cs330 Deep Multi-Task and Meta Learning.

Model-Agnostic Meta-Learning Algorithm — Case Study

Written by Sharifdeen Ashshak