Generating Song Lyrics Using RNN Step by Step in TensorFlow

In this post, we will look at how to generate song lyrics using Recurrent Neural Networks (RNNs) in TensorFlow. To do this, we simply build a character-level RNN, meaning that on every time step, we predict a new character.

Let’s consider a small sentence: “What a beautiful d.”

At the first time step, the RNN predicts a new character as a. The sentence will be updated to, “What a beautiful da.”

At the next time step, it predicts a new character as y, and the sentence becomes, “What a beautiful day.”

In this manner, we predict a new character at each time step and generate a song. Instead of predicting a new character every time, we can also predict a new word every time, which is called word level RNN. For simplicity, let’s start with a character level RNN.

How do RNNs predict a new character on each time step? Let’s suppose that at a time step t=0, we feed an input character, let’s say, x. Now, the RNN predicts the next character based on the given input character, x. To predict the next character, it predicts the probability of all the characters in our vocabulary to be the next character. Once we have this probability distribution, we randomly select the next character based on this probability. Confusing? Let us better understand this with an example.

For instance, as shown in the following figure, let’s suppose that our vocabulary contains four characters L, O, V, and E; when we feed the character l as an input, it computes the probability of all the words in the vocabulary to be the next character:

Generating Song Lyrics Using RNN Step by Step in TensorFlow
Generating Song Lyrics Using RNN Step by Step in TensorFlow

So, we have the probabilities as [0.0, 0.9, 0.0, 0.1], corresponding to the characters in the vocabulary [L,O,V,E]. With this probability distribution, we select O as the next character 90% of the time, and E as the next character 10% of the time. Predicting the next character by sampling from this probability distribution adds some randomness to the output.

On the next time step, we feed the predicted character from the previous time step and the previous hidden state as an input to predict the next character, as shown in the following figure:

Generating Song Lyrics Using RNN Step by Step in TensorFlow
Generating Song Lyrics Using RNN Step by Step in TensorFlow

So, on each time step, we feed the predicted character from the previous time step and the previous hidden state as input and predict the next character:

Generating Song Lyrics Using RNN Step by Step in TensorFlow
Generating Song Lyrics Using RNN Step by Step in TensorFlow

As you can see in the above figure, at time step t=2, V is passed as an input, and it predicts the next character as E. But this does not mean that every time character V is sent as an input it should always return E as output. Since we are passing input along with the previous hidden state, the RNN has the memory of all the characters it has seen so far.

So, the previous hidden state captures the essence of the previous input characters, which are L and O. Now, with this previous hidden state and the input V, the RNN predicts the next character as E.


Implementing RNN in TensorFlow

Now, we will look at how to build the RNN model in TensorFlow to generate song lyrics. The dataset used in this section can be downloaded from here. You can unzip the dataset and place it in the data folder and start running the below code.

First, let us import the required libraries:

Data Preparation

Read the downloaded input dataset:

Let us see few rows from our data:

Generating Song Lyrics Using RNN Step by Step in TensorFlow
Generating Song Lyrics Using RNN Step by Step in TensorFlow

Our dataset consists of about 57,650 song lyrics:

>> 57650

We have song lyrics of about 643 artists:

>> 643

The number of songs from each artist is shown as follows:

Generating Song Lyrics Using RNN Step by Step in TensorFlow
Generating Song Lyrics Using RNN Step by Step in TensorFlow

On average, we have about 89 songs of each artist:

>> 89

We have song lyrics in the column text, so we combine all the rows of that column and save it as a text in a variable called data, as follows:

Let’s see a few lines of a song:

>> “Look at her face, it’s a wonderful face \nAnd it means something special to me \nLook at the way that she smiles when she sees me \nHow lucky can one fellow be? \n \nShe’s just my kind of girl, she makes me feel fine \nWho could ever believe that she could be mine? \nShe’s just my kind of girl, without her I’m blue \nAnd if she ever leaves me what could I do, what co”

Since we are building a char-level RNN, we will store all the unique characters in our dataset into a variable called chars; this is basically our vocabulary:

Store the vocabulary size in a variable called vocab_size:

Since the neural networks only accept the input in numbers, we need to convert all the characters in the vocabulary to a number.

We map all the characters in the vocabulary to their corresponding index that forms a unique number. We define a char_to_ix dictionary, which has a mapping of all the characters to the index. To get the index by a character, we also define the ix_to_char dictionary, which has a mapping of all the indices to their respective characters:

As you can see in the following code snippet, the character ‘s’ is mapped to an index 68 in the char_to_ix dictionary:

>> 68

Similarly, if we give 68 as an input to the ix_to_char, then we get the corresponding character, which is ‘s’:

>> ‘s’

Once we obtain the character to integer mapping, we use one-hot encoding to represent the input and output in vector form. A one-hot encoded vector is basically a vector full of 0s, except for a 1 at a position corresponding to a character index.

For example, let’s suppose that the vocabSize is 7, and the character z is in the fourth position in the vocabulary. Then, the one-hot encoded representation for the character z can be represented as follows:

>> array([0., 0., 0., 0., 1., 0., 0.])

As you can see, we have a 1 at the corresponding index of the character, and the rest of the values are 0s. This is how we convert each character into a one-hot encoded vector.

In the following code, we define a function called one_hot_encoder, which will return the one-hot encoded vectors, given an index of the character:

Defining the Network Parameters

We need to define all the network parameters.

Defining Placeholders

Now, we will define the TensorFlow placeholders. The placeholders for the input and output are as follows:

Define the placeholder for the initial hidden state:

Define an initializer for initializing the weights of the RNN:

Defining forward propagation

Let’s define the forward propagation involved in the RNN, which is mathematically given as follows:

h_t = \operatorname{tanh}(U x_t + W h_{t-1} + bh)
\hat{y} = \operatorname{softmax}(V h_t + by)

Apply softmax on the output and get the probabilities:

Compute the cross-entropy loss:

Store the final hidden state of the RNN in hprev. We use this final hidden state for making predictions:

Defining Backpropagation Through Time

Now, we will perform the BPTT, with Adam as our optimizer. We will also perform gradient clipping to avoid the exploding gradients problem.

Initialize the Adam optimizer:

Compute the gradients of the loss with the Adam optimizer:

Set the threshold for the gradient clipping:

Clip the gradients which exceeds the threshold and bring it to the range:

Update the gradients with the clipped gradients:

Start generating songs

Start the TensorFlow session and initialize all the variables:

Now, we will look at how to generate the song lyrics using an RNN. What should the input and output to the RNN be? How does it learn? What is the training data? Let’s see the explanation, along with the code, step by step.

We know that in RNNs, the output predicted at a time step t will be sent as the input to the next time step; that is, on every time step, we need to feed the predicted character from the previous time step as input. So, we prepare our dataset in the same way.

For instance, look at the following table. Let’s suppose that each row is a different time step; on a time step t=0 , the RNN predicted a new character, g, as the output. This will be sent as the input to the next time step t=1, .

However, if you notice the input in the time step t=1, we removed the first character from the input o and added the newly predicted character g at the end of our sequence. Why are we removing the first character from the input? Because we need to maintain the sequence length.

Let's suppose that our sequence length is eight; adding a newly predicted character to our sequence increases the sequence length to nine. To avoid this, we remove the first character from the input, while adding a newly predicted character from the previous time step.

Similarly, in the output data, we also remove the first character on each time step, because once it predicts the new character, the sequence length increases. To avoid this, we remove the first character from the output on each time step, as shown in the following table:

Generating Song Lyrics Using RNN Step by Step in TensorFlow
Generating Song Lyrics Using RNN Step by Step in TensorFlow

Now, we will look at how we can prepare our input and output sequence similar to the preceding table. Instead of looking at the complete code, we will see the code step by step. The complete code block is given at the end.

Define a variable called pointer, which points to the character in our dataset. We will set our pointer to 0, which means it points to the first character:

Define the input data:

What does this mean? With the pointer and the sequence length, we slice the data. Consider that the seq_length is 25 and the pointer is 0. It will return the first 25 characters as input. So, data[pointer:pointer + seq_length] returns the following output:

"Look at her face, it's a "

Define the output, as follows:

We slice the output data with one character ahead moved from input data. So, data[pointer + 1:pointer + seq_length + 1] returns the following:

"ook at her face, it's a w"

As you can see, we added the next character in the sentence and removed the first character. So, on every iteration, we increment the pointer and traverse the entire dataset. This is how we obtain the input and output sentence for training the RNN.

As you have learned, an RNN only accepts numbers as input. Thus, once we sliced the input and output sequence, we get the indices of the respective characters, using the char_to_ix dictionary that we defined:

Convert the indices into one-hot encoded vectors by using the one_hot_encoder function we defined previously:

This input_vector and target_vector become the input and output for training the RNN. Let's start training.

The hprev_val variable stores the last hidden state of our trained RNN model. We use this for making predictions, and we store the loss in loss_val:

We train the model for n iterations. After training, we start making predictions. Now, we will look at how to make predictions and generate song lyrics using our trained RNN. Set the sample_length, that is, the length of the sentence (song) we want to generate:

Randomly select the starting index of the input sequence:

Select the input sentence with the randomly selected index:

As we know, we need to feed the input as numbers; convert the selected input sentence to indices:

Remember, we stored the last hidden state of the RNN in hprev_val. We used that for making predictions. Now, we will create a new variable called sample_prev_state_val by copying values from hprev_val.

The sample_prev_state_val variable is used as an initial hidden state for making predictions:

Initialize the list for storing the predicted output indices:

Now, for t in range of sample_length, we perform the following and generate the song for the defined sample_length.

Convert the sampled_input_indices to the one-hot encoded vectors:

Feed the sample_input_vector, and also the hidden state sample_prev_state_val, as the initial hidden state to the RNN, and get the predictions. We store the output probability distribution in probs_dist:

Randomly select the index of the next character with the probability distribution generated by the RNN:

Add this newly predicted index, ix, to the sample_input_indices, and also remove the first index from sample_input_indices to maintain the sequence length. This will form the input for the next time step:

Store all the predicted chars indices in the predicted_indices list:

Convert all the predicted_indices to their characters:

Combine all the predicted_Chars and save it as text:

Print the predicted text on every 50,000th iteration:

Increment the pointer and iteration:

Complete code block for generating songs

>>

After 0 iterations

uhiq iUYo9ra)1FZTezk3FGF CoG)i,uzWKrmpmqJuo(94rKty5"y-7t,1])zfwR2 FhFHcqdP2qy[!mw(1R[?xS?n(O-"x5?"k!efK MCnHSNA0h!SovpSppQ-(m,KBfn9"j.95p86F?Mun0[qdJ-L7F.Wv!W.GunT9CnwfGobu"WA?qAtfmREbZGCjDYvl:jN"D7?iMisuv2hgH1((Z0XepXA7G.Z28znn'SaEzDzCUuM7Fr0)ahqaX7!sJf-B1a5gO!iW8LbnJO1q" kjB jI2ZNd F-AX-hHhS-rM?RMy73ON[(Xu,N3T[AYD[?anzGVaBn4A Un0fOt!aDdNt7C)dSYapFz[!79"B9Z5Y!KJLD (zQDZmjpnY7-546t?T ?shi8W BA8Yp.ghO7up[:p?zM7::tuU "QDy97iWi:B:YBCVZ4)7NJnO[OW3zMvzc4liEs'CY,6-M,B[hBZ ?wtxdCRi-GlXtc[TmaB07L2Ui!hkk

---------------------------------------------------------------------

After 50000 iterations

you say that you love me in the way i do in my town of red pain the sky blue arms around you

.

.

After training over several iterations, RNN will learn to generate better songs. In order to get better results, you can train the network with huge dataset for several iterations.

The complete code used in this section is available as jupyter notebook here.

Got any queries or doubts? Ask me in the comments.

Leave a Reply