The 3W’s of RNN — What, Why, and Working

RNNs are feedforward neural networks deep down but with a catch, i.e., they have memory

Definition:

RNNs are a class of Neural Networks that work on modelling sequential data. Recurrent Neural Networks use their reasoning from previous experiences to inform the upcoming events with the help of internal memory.

Why RNNs?

Before answering that, let me give an intuition on the importance of memory during predicting a sequence

Let’s take a sentence:

Usain Bolt, an eight-time Olympic gold medallist, is from ___

If I ask you to fill in the blank, your answer might probably be ‘Jamaica’ or ‘North America’. But what made you come up with those answers?
It’s the word ‘Usain Bolt’. This example illustrates the importance of memory in predicting a word.

Imagine using an ANN to fill the blank. It would predict the next word with just the help of ‘from’, which can be gibberish. But RNNs remember what they saw earlier, making them capable of producing a reasonably accurate answer.

Well, how does a Neural Network have memory?

An RNN calculates hidden state value based on the current input and previous timestep’s hidden state. In simple words, whatever happened in the previous timestep affects the current output. This is basically remembering something and using it when needed.

Feedforward Network (ANN)

Recurrent Neural Network (RNN)

RNN unrolled in time

A RNN representation is NOT multiple layers of ANN, rather a single ANN at multiple timesteps

Working

Forward Pass-

At time t,

As we can observe here, the hidden state is dependent on the current input and the previous timestep’s hidden state

At each time step t, only the Hidden State (h) keeps updating while the Function (F) and Weights (W) stay the same

Common activation functions ‘F’ are — Tanh, RelU, Sigmoid

Loss-

The loss function for an RNN is defined as the sum of individual loss values of each timestep

Forward Pass and Loss

Backpropagation Through Time (BPTT)-

Backpropagation is done at each point in time. At timestep T, the derivative of the loss L with respect to weight matrix W is expressed as follows

Backpropagation Through Time

In a simple feedforward system, the gradient of the loss is calculated for each parameter.

The same happens here, but it happens for each timestep as all timesteps together contribute to the total loss.

And as each time step is dependent on the previous timestep, calculating the gradient for timestep, say t, would require gradients of t=0 till t=t-1.

If you’re able to follow till here, YOU’RE DOING AMAZING.

But we have 2 new villains here. They are Exploding and Vanishing gradients. During BPTT, we have to multiply multiple gradients. For example, calculating for t=15 would require multiplying 15 gradient values

Exploding Gradients:

Happens when many gradient values > 1

Multiplying large numbers results in an even bigger output which leads to drastic value updates

Eg. 10 5 10 * 7= 3500

VanishingGradients:

Happens when many gradient values < 1

Multiplying small numbers results in an even smaller output which has no significant effect while updating values

Eg. 0.01 0.5 0.1 * 0.01 = 0.000005

So it is evident that RNNs are useful in figuring out small sequences. But it fails when it encounters a large sequence. That is, it has short term memory.

Here is where gated cells come in.

LSTMs and GRUs were implemented to tackle the short term memory problem. These networks are very complex and have a long term memory making them very efficient in long sequences.

Implementing RNN:

I will be implementing RNN using Pytorch as it is easier to understand everything that’s happening under the hood.

Define a class and the constructor

2. Create a function to generate a hidden layer for the first pass

3. Create a ‘forward’ function that defines the flow of data.

Create an initial hidden state
Pass input and hidden state to RNN
Flatten the output
Pass output to dense layer and get the prediction

All Together -

Training (Illustration without actual data)**-**

This model takes 28 dimension input (28 unique words in the dictionary), has 12 dimensions in hidden layer and produces 10 dimension output. After applying softmax, taking the class with the highest probability gives us the predicted word

I hope this article was able to provide what you were looking for and help you in your Deep Learning journey. Do drop down your reviews and share this along :)