Understanding Siamese Networks

Siamese networks are the special types of neural networks and it is one of the simplest and popularly used one shot learning algorithms. One shot learning is a technique where we learn from only one training example per each class. So, the siamese network is predominantly used in applications where we don’t have many data points for each of the class.

For instance, let us say we want to build a face recognition model for our organization and say about 500 people are working in our organization. If we want to build our face recognition model using the convolutional neural network (CNN) from scratch then we need many images of all these 500 people for training the network and attaining good accuracy. But apparently, we will not have many images for all these 500 people and so it is not feasible to build a model using a CNN or any deep learning algorithm unless we have sufficient data points. So in these kinds of scenarios, we can resort to a sophisticated one-shot learning algorithm like a siamese network which can learn from less data points.

But how do siamese networks work?

Siamese networks basically consist of two symmetrical neural networks both sharing the same weights and architecture and both joined together at the end using some energy function E. The objective of our siamese network is to learn whether the two inputs are similar or dissimilar. Let’s say we have two images X_1 and X_2 and we want to learn whether the two images are similar or dissimilar.

As shown in the following diagram, we feed the image X_1 to network A and the image X_2 to another Network B. The role of both of these networks is to generate embeddings (feature vectors) for the input image. So we can use any network that will give us embeddings. Since our input is an image, we can use a convolutional network for generating the embeddings that is, for extracting features. Remember the role of CNN here is only to extract features and not to classify. As we know that these networks should have same weights and architecture, if our Network A is three-layer CNN then our Network B should also be a three-layer CNN and we have to use the same set of weights for both of these networks. So Network A and Network B will give us the embeddings for the input image X_1 and X_2 respectively. Then we will feed these embeddings to the energy function which tells us how similar the two input images are. Energy functions are basically any similarity measure such as Euclidean distance, cosine similarity, and so on.

Siamese networks are not only used for face recognition, but it is also used extensively in the applications where we don’t have much data points and tasks where we need to learn a similarity between two inputs. The application of siamese network includes signature verification, similar question retrieval, object tracking and more.


Architecture of Siamese Network

Now that we have a basic understanding of siamese networks, we will explore them in detail. The architecture of a siamese network is shown in the following diagram:

Architecture of Siamese Network
Architecture of Siamese Network

As you can see in the preceding diagram, a siamese network consists of two identical networks both sharing the same weights and architecture. Let’s say we have two inputs, X_1 and X_2 . We feed our input X_1 to Network A, that is, f_w (X_1 ) , and we feed our input X_2 to Network B, that is, f_w (X_2 ) . As you will notice, both of these networks have the same weights, w, and they will generate embeddings for our input, X_1 and X_2 . Then, we feed these embeddings to the energy function, E, which will give us a similarity between the two inputs.

It can be expressed as follows:

E_{W}\left(X_{1}, X_{2}\right)=\left|f_{W}\left(X_{1}\right)-f_{W}\left(X_{2}\right)\right|

Let’s say we use Euclidean distance as our energy function, then the value of E will be less, if X_1 and X_2 are similar. The value of E will be large if the input values are dissimilar.

Assume that you have two sentences, sentence 1 and sentence 2. We feed sentence 1 to Network A and sentence 2 to Network B. Let’s say both our Network A and Network B are LSTM networks and they share the same weights. So, Network A and Network B will generate the word embeddings for sentence 1 and sentence 2 respectively.

Then, we feed these embeddings to the energy function, which gives us the similarity score between the two sentences. But how can we train our siamese networks? How should the data be? What are the features and labels? What is our objective function? The input to the siamese networks should be in pairs, (X_1 , X_2 ) , along with their binary label, Y ∈ (0, 1), stating whether the input pairs are a genuine pair (same) or an imposite pair (different). As you can see in the following table, we have sentences as pairs and the label implies whether the sentence pairs are genuine (1) or imposite (0):

data for Siamese Network

So, what is the loss function of our siamese network? Since the goal of the siamese network is not to perform a classification task but to understand the similarity between the two input values, we use the contrastive loss function.

It can be expressed as follows:

\operatorname{Contractive Loss}=Y(E)^{2}+(1-Y) \max (\operatorname{margin}-E, 0)^{2}

In the preceding equation, the value of Y is the true label, which will be 1 when the two input values are similar and 0 if the two input values are dissimilar, and E is our energy function, which can be any distance measure. The term margin is used to hold the constraint, that is, when two input values are dissimilar, and if their distance is greater than a margin, then they do not incur a loss.

Leave a Reply