Sequence Models

Certificate

Recurrent Neural Networks

Recurrent Neural Network Model

A Recurrent Neural Network (RNN) is a type of neural network designed for processing sequences of data. Unlike feedforward neural networks, which process data in a strictly linear fashion, RNNs have connections that loop back on themselves, allowing them to maintain a hidden state that captures information about previous elements in a sequence. This makes RNNs particularly well-suited for tasks involving sequences, such as time series analysis, natural language processing, and speech recognition.

Here's a basic explanation of how an RNN works and an example:

  1. Structure of an RNN: An RNN consists of three main components:

    • Input Layer: It takes input sequences, where each element in the sequence is represented as a vector. For example, in natural language processing, each word in a sentence can be represented as a one-hot encoded vector or word embeddings.

    • Hidden Layer: This layer maintains a hidden state that captures information about the past elements in the sequence. The hidden state is updated at each time step and is used to carry information from one step to the next.

    • Output Layer: This layer produces the output based on the hidden state. The output can be a prediction, classification, or any other task-specific result.

  2. Recurrent Connections: The key idea behind RNNs is that they have recurrent connections that allow information to flow from one time step to the next. The hidden state at each time step is a function of the current input and the previous hidden state, which forms a recurrent loop.

  3. Example: Language Modeling with an RNN: Let's consider a simple example of an RNN for language modeling, where the goal is to predict the next word in a sentence given the previous words. Suppose you have the following sentence: "The quick brown fox."

    • Input Encoding: Each word is encoded as a vector (e.g., using word embeddings), and the input sequence is fed into the RNN one word at a time.

    • Hidden State: At each time step, the RNN updates its hidden state based on the current input and the previous hidden state. This hidden state captures information about the words seen so far.

    • Output Prediction: After processing all words, the RNN can generate the next word in the sentence as the output.

    For instance, if the input is "The quick brown," the RNN may predict "fox" as the next word, as it has learned from the training data that "fox" often follows these words in a sentence.

RNNs have some limitations, such as difficulty in capturing long-term dependencies (the vanishing gradient problem) and not being able to handle sequences of variable lengths efficiently. To address these issues, more advanced RNN architectures like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) have been developed. These architectures are designed to better capture long-range dependencies and have become popular choices for many sequence-related tasks.

Simplified RNN notation

a<t>=g(Waaa<t1>+Waxx<t>+ba)a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)

y^<t>=g(Wyaa<t>+by)\hat{y}^{<t>} = g(W_{ya}a^{<t>} + b_y)

Different Types of RNNs

  • One to one

  • One to many

  • Many to one

  • Many to many

From Medium

What is a Recurrent Neural Network?

Training a typical neural network involves the following steps:

  1. Input an example from a dataset.

  2. The network will take that example and apply some complex computations to it using randomly initialised variables (called weights and biases).

  3. A predicted result will be produced.

  4. Comparing that result to the expected value will give us an error.

  5. Propagating the error back through the same path will adjust the variables.

  6. Steps 1–5 are repeated until we are confident to say that our variables are well-defined.

  7. A predication is made by applying these variables to a new unseen input.

Of course, that is a quite naive explanation of a neural network, but, at least, gives a good overview and might be useful for someone completely new to the field.

Recurrent neural networks work similarly but, in order to get a clear understanding of the difference, we will go through the simplest model using the task of predicting the next word in a sequence based on the previous ones.

First, we need to train the network using a large dataset. For the purpose, we can choose any large text (“War and Peace” by Leo Tolstoy is a good choice). When done training, we can input the sentence “Napoleon was the Emperor of…” and expect a reasonable prediction based on the knowledge from the book.

So, how do we start? As explained above, we input one example at a time and produce one result, both of which are single words. The difference with a feedforward network comes in the fact that we also need to be informed about the previous inputs before evaluating the result. So you can view RNNs as multiple feedforward neural networks, passing information from one to the other.

Let’s examine the following schema:

Here x_1, x_2, x_3, …, x_t represent the input words from the text, y_1, y_2, y_3, …, y_t represent the predicted next words and h_0, h_1, h_2, h_3, …, h_t hold the information for the previous input words.

Since plain text cannot be used in a neural network, we need to encode the words into vectors. The best approach is to use word embeddings (word2vec or GloVe) but for the purpose of this article, we will go for the one-hot encoded vectors. These are (V,1) vectors (V is the number of words in our vocabulary) where all the values are 0, except the one at the i-th position. For example, if our vocabulary is apple, apricot, banana, …, king, … zebra and the word is banana, then the vector is [0, 0, 1, …, 0, …, 0].

Typically, the vocabulary contains all English words. That is why it is necessary to use word embeddings.

Let’s define the equations needed for training:

  • 1) —holds information about the previous words in the sequence. As you can see, h_t is calculated using the previous h_(t-1) vector and current word vector x_t. We also apply a non-linear activation function f (usually tanh or sigmoid) to the final summation. It is acceptable to assume that h_0 is a vector of zeros.

  • 2) — calculates the predicted word vector at a given time step t. We use the softmax function to produce a (V,1) vector with all elements summing up to 1. This probability distribution gives us the index of the most likely next word from the vocabulary.

  • 3) — uses the cross-entropy loss function at each time step t to calculate the error between the predicted and actual word.

If you are wondering what these W’s are, each of them represents the weights of the network at a certain stage. As mentioned above, the weights are matrices initialised with random elements, adjusted using the error from the loss function. We do this adjusting using back-propagation algorithm which updates the weights. I will leave the explanation of that process for a later article but, if you are curious how it works, Michael Nielsen’s book is a must-read.

Once we have obtained the correct weights, predicting the next word in the sentence “Napoleon was the Emperor of…” is quite straightforward. Plugging each word at a different time step of the RNN would produce h_1, h_2, h_3, h_4. We can derive y_5 using h_4 and x_5 (vector of the word “of”). If our training was successful, we should expect that the index of the largest number in y_5 is the same as the index of the word “France” in our vocabulary.

Problems with a standard RNN

Unfortunately, if you implement the above steps, you won’t be so delighted with the results. That is because the simplest RNN model has a major drawback, called vanishing gradient problem, which prevents it from being accurate.

In a nutshell, the problem comes from the fact that at each time step during training we are using the same weights to calculate y_t. That multiplication is also done during back-propagation. The further we move backwards, the bigger or smaller our error signal becomes. This means that the network experiences difficulty in memorising words from far away in the sequence and makes predictions based on only the most recent ones.

That is why more powerful models like LSTM and GRU come in hand. Solving the above issue, they have become the accepted way of implementing recurrent neural networks.

Backpropagation in RNN Model

Backpropagation in Recurrent Neural Networks (RNNs) is a fundamental algorithm for training these networks to make predictions on sequential data. I'll walk you through the backpropagation process with an example to illustrate how it works in an RNN.

Suppose we have an RNN with a single hidden layer and we want to train it to predict the next character in a sequence of text. The RNN processes the text character by character, and at each time step, it takes the current character and hidden state as input and produces an output character prediction. Let's go through a simplified example.

  1. Initialization:

    • Initialize the RNN's weights and biases, including the weights for the input-to-hidden connections (W_in), hidden-to-hidden connections (W_hh), hidden-to-output connections (W_out), and biases.

    • Initialize the hidden state (h) as a vector of zeros.

  2. Forward Pass:

    • The RNN processes the input sequence character by character, one time step at a time.

    • At each time step t, the RNN takes the current input character (x_t) and the previous hidden state (h_{t-1}) as inputs.

    • It computes the current hidden state (h_t) using the following formula: h_t = tanh(W_in * x_t + W_hh * h_{t-1})

    • The tanh function is a common activation function used in RNNs. It squashes the values to the range [-1, 1].

    • The RNN uses the current hidden state to make a prediction for the next character (y_t): y_t = softmax(W_out * h_t)

  3. Calculate Error:

    • Compare the predicted character (y_t) with the actual target character (y_t_actual) at each time step.

    • Calculate the cross-entropy loss for each time step: L_t = -log(y_t[y_t_actual])

    • The overall loss for the entire sequence is the sum of individual time step losses.

  4. Backward Pass:

    • Start the backpropagation process by calculating the gradients of the loss with respect to the output layer weights and biases (W_out). ∂L/∂W_out = (y_t - y_t_actual) * h_t

    • Update the output layer weights using an optimization algorithm like stochastic gradient descent (SGD).

    • Next, calculate the gradients of the loss with respect to the hidden state at each time step. ∂L/∂h_t = (W_out^T * (y_t - y_t_actual)) + ∂L/∂h_{t+1} where ∂L/∂h_{t+1} is the gradient propagated from the next time step.

    • Propagate the gradients of the loss backward through time: ∂L/∂h_{t-1} = (W_hh^T * ∂L/∂h_t)

    • Update the hidden layer weights (W_in and W_hh) using the gradients: ∂L/∂W_in = ∂L/∂h_t * x_t ∂L/∂W_hh = ∂L/∂h_t * h_{t-1}

    • Perform weight updates for W_in and W_hh using an optimization algorithm.

  5. Repeat:

    • Continue the forward and backward passes for multiple iterations (epochs) until the loss converges to a minimum value.

This process iterates over the entire training dataset and gradually adjusts the network's weights and biases to minimize the error. The RNN learns to capture patterns in the sequential data, allowing it to make better predictions over time. This example simplifies the process for clarity, and in practice, more complex RNN architectures and optimization techniques are used for training.

Language Model and Sequence Generation

Language modeling is a crucial concept in natural language processing (NLP) and artificial intelligence. It involves building statistical models that predict the likelihood of a sequence of words or characters in a given language. These models are used in a wide range of NLP tasks, including machine translation, speech recognition, text generation, and more.

A language model estimates the probability of a sequence of words or characters based on the context provided by the preceding words. One common approach to language modeling is using n-grams, where you estimate the probability of a word based on the previous n-1 words. However, more advanced models, such as neural language models, have become prominent due to their ability to capture long-range dependencies and understand context better.

Here's a simple example of language modeling using a neural network-based approach:

  1. Word-Level Language Model:

    Let's say you want to build a language model that predicts the next word in a sentence. You can use a recurrent neural network (RNN) or transformer-based architecture. Consider the following sentence: "The cat sat on the ____."

    • You tokenize the sentence into words: ["The", "cat", "sat", "on", "the", "____"].

    • The language model takes the first five words as input and predicts the probability distribution over possible next words for the blank space.

    • For instance, it might predict the word "mat" with high probability: P("mat" | "The", "cat", "sat", "on", "the").

    The model is trained on a large corpus of text, and during training, it learns the statistical patterns and relationships between words, enabling it to make predictions like this.

  2. Character-Level Language Model:

    Character-level language models predict the next character in a sequence of characters. For example, you might be trying to generate text letter by letter. Consider the following incomplete word: "Wor__".

    • The model takes the input "Wor" and predicts the next character, which might be "k" with high probability: P("k" | "Wor").

    Character-level language models can be used for text generation, including tasks like auto-completion, text generation, or even generating source code.

  3. Text Generation:

    Language models are used for text generation tasks, where the model generates coherent and contextually relevant text. For example, you can prompt a language model with an initial phrase, and it will continue generating text in the same style and context. Here's an example using a prompt and a fictional model's output:

    Prompt: "Once upon a time in a land far, far away,"

    Model Output: "there lived a brave knight named Sir Arthur. He embarked on a quest to rescue the princess from the clutches of the wicked dragon that terrorized the kingdom."

Language models have been used to generate creative text, assist in content creation, and even simulate conversations in chatbots and virtual assistants.

Language models have evolved over the years, with state-of-the-art models like GPT-3 and GPT-4 capable of generating highly coherent and contextually relevant text across various languages and domains. These models have a wide range of applications, including content generation, language translation, summarization, and more.

Vanishing Gradients with RNN

In deep learning, especially in the training of neural networks, the vanishing gradient problem occurs when the gradients of the loss function with respect to the weights become extremely small. This is particularly problematic in the context of RNNs.

Here's a brief breakdown:

  1. Recurrent Neural Networks (RNNs):

    • RNNs are designed to work with sequential data, processing inputs one step at a time while maintaining a hidden state that captures information from previous steps. This hidden state is updated at each time step based on the current input and the previous hidden state.

  2. Backpropagation Through Time (BPTT):

    • Training an RNN involves using backpropagation through time (BPTT). It's essentially an extension of backpropagation to handle sequences. The gradients are calculated with respect to the weights over multiple time steps.

  3. The Vanishing Gradient Problem:

    • The issue arises when the gradients calculated during backpropagation become extremely small as they are propagated back through time. This is especially problematic in long sequences. When the gradients become too small, the weights may not be updated effectively, leading to slow or stalled learning.

  4. Cause of Vanishing Gradients in RNNs:

    • In RNNs, the vanishing gradient problem is often caused by the repeated multiplication of small values during the backpropagation process. The gradients can diminish exponentially as they are backpropagated through time steps, especially if the weights in the network are small.

  5. Impact on Learning:

    • If the gradients vanish, the RNN struggles to capture long-term dependencies in the data. It may only effectively learn short-term dependencies, limiting its ability to understand and utilize information from distant time steps.

  6. Solutions:

    • Several techniques have been proposed to address the vanishing gradient problem in RNNs. Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) are two types of specialized RNN architectures that have been designed to mitigate this issue. These architectures include mechanisms to selectively retain or update information over time, allowing for better learning of long-range dependencies.

In summary, the vanishing gradient problem in RNNs arises when gradients become extremely small during backpropagation through time, hindering the effective learning of long-term dependencies in sequential data. Specialized RNN architectures, such as LSTMs and GRUs, have been developed to address this challenge.

Example: Predicting the Next Word in a Sentence

Imagine you have an RNN tasked with predicting the next word in a sentence. The network takes words as inputs and updates its hidden state at each time step. The goal is to learn the patterns and relationships between words.

  1. Problematic Scenario:

    • Consider a long sentence where the relevant information for predicting the next word is far back in the sequence. As the gradients are backpropagated through time, they get multiplied at each time step.

  2. Multiplicative Effect:

    • Let's say the weights in the network are relatively small (e.g., between 0 and 1). As the gradients are multiplied during backpropagation, they could become exponentially smaller as they go back in time.

  3. Vanishing Gradients:

    • By the time the gradients reach the earlier time steps, they could be so close to zero that the weights associated with those time steps don't get updated effectively.

  4. Impact on Learning:

    • The RNN may struggle to capture the relationships between words that are far apart in the sequence. It might focus more on short-term dependencies and fail to learn the long-range dependencies crucial for understanding the context of the sentence.

Now, let's contrast this with an example of how a specialized architecture, such as an LSTM, addresses the vanishing gradient problem:

Example with LSTM:

  • LSTMs have mechanisms like input, forget, and output gates that control the flow of information. These gates help the network decide what information to retain, update, or forget, mitigating the vanishing gradient problem.

In the context of our sentence prediction task, an LSTM would be better equipped to capture long-range dependencies, allowing it to learn and utilize information from earlier time steps more effectively than a standard RNN.

Gated Recurrent Unit (GRU)

Gated Recurrent Unit (GRU) is a type of recurrent neural network (RNN) architecture designed to address the vanishing gradient problem and improve the learning of long-term dependencies. It's similar to Long Short-Term Memory (LSTM) networks but has a slightly simpler structure.

Let's break down the main components of a GRU with an example:

Components of a GRU:

  1. Update Gate (ztz_t​):

    • Determines how much of the previous state (ht1h_{t−1​}) to keep and how much of the new candidate state (hth_t^′​) to incorporate.

  2. Reset Gate (rtr_t​):

    • Controls how much of the previous state to forget.

  3. Current Memory Content (hth_t^′​):

    • A candidate new state that is computed based on the current input and the reset gate.

  4. Hidden State (hth_t​):

    • The updated hidden state, a combination of the previous state and the new candidate state.

Now, let's consider an example of using a GRU for sentiment analysis.

Example: Sentiment Analysis with GRU

  1. Input Sequences:

    • Input sequences are sentences of variable length, representing movie reviews.

  2. Tokenization and Embedding:

    • Each word in the sentences is tokenized and embedded into a vector space.

  3. GRU Architecture:

    • The GRU processes the embedded words in a sequential manner, maintaining a hidden state (hth_t​).

  4. Sentiment Prediction:

The final hidden state (hth_t​) after processing the entire sequence is used for sentiment prediction.

  1. Training:

    • During training, the model learns to adjust the weights associated with the update gate, reset gate, and candidate state to capture the sentiment information in the input sequences.

  2. Prediction:

    • Given a new movie review, the trained GRU can predict the sentiment of the review.

    The use of GRU in this example allows the model to capture dependencies between words in the input sequences, considering both short-term and long-term information. The update and reset gates enable the GRU to selectively retain or discard information, improving its ability to understand the context and sentiment of the reviews.

From Medium

In this article, I will try to give a fairly simple and understandable explanation of one really fascinating type of neural network. Introduced by Cho, et al. in 2014, GRU (Gated Recurrent Unit) aims to solve the vanishing gradient problem which comes with a standard recurrent neural network. GRU can also be considered as a variation on the LSTM because both are designed similarly and, in some cases, produce equally excellent results. If you are not familiar with Recurrent Neural Networks, I recommend reading my brief introduction. For a better understanding of LSTM, many people recommend Christopher Olah’s article. I would also add this paper which gives a clear distinction between GRU and LSTM.

How do GRUs work?

As mentioned above, GRUs are improved version of standard recurrent neural network. But what makes them so special and effective?

To solve the vanishing gradient problem of a standard RNN, GRU uses, so-called, update gate and reset gate. Basically, these are two vectors which decide what information should be passed to the output. The special thing about them is that they can be trained to keep information from long ago, without washing it through time or remove information which is irrelevant to the prediction.

To explain the mathematics behind that process we will examine a single unit from the following recurrent neural network:

Here is a more detailed version of that single GRU:

First, let’s introduce the notations:

If you are not familiar with the above terminology, I recommend watching these tutorials about “sigmoid” and “tanh” function and “Hadamard product” operation.

#1. Update gate

We start with calculating the update gate z_t for time step t using the formula:

When x_t is plugged into the network unit, it is multiplied by its own weight W(z). The same goes for h_(t-1) which holds the information for the previous t-1 units and is multiplied by its own weight U(z). Both results are added together and a sigmoid activation function is applied to squash the result between 0 and 1. Following the above schema, we have:

The update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future. That is really powerful because the model can decide to copy all the information from the past and eliminate the risk of vanishing gradient problem. We will see the usage of the update gate later on. For now remember the formula for z_t.

#2. Reset gate

Essentially, this gate is used from the model to decide how much of the past information to forget. To calculate it, we use:

This formula is the same as the one for the update gate. The difference comes in the weights and the gate’s usage, which will see in a bit. The schema below shows where the reset gate is:

As before, we plug in h_(t-1) — blue line and x_t — purple line, multiply them with their corresponding weights, sum the results and apply the sigmoid function.

#3. Current memory content

Let’s see how exactly the gates will affect the final output. First, we start with the usage of the reset gate. We introduce a new memory content which will use the reset gate to store the relevant information from the past. It is calculated as follows:

  1. Multiply the input x_t with a weight W and h_(t-1) with a weight U.

  2. Calculate the Hadamard (element-wise) product between the reset gate r_t and Uh_(t-1). That will determine what to remove from the previous time steps. Let’s say we have a sentiment analysis problem for determining one’s opinion about a book from a review he wrote. The text starts with “This is a fantasy book which illustrates…” and after a couple paragraphs ends with “I didn’t quite enjoy the book because I think it captures too many details.” To determine the overall level of satisfaction from the book we only need the last part of the review. In that case as the neural network approaches to the end of the text it will learn to assign r_t vector close to 0, washing out the past and focusing only on the last sentences.

  3. Sum up the results of step 1 and 2.

  4. Apply the nonlinear activation function tanh.

You can clearly see the steps here:

We do an element-wise multiplication of h_(t-1) — blue line and r_t — orange line and then sum the result — pink line with the input x_t — purple line. Finally, tanh is used to produce h’_t — bright green line.

#4. Final memory at current time step

As the last step, the network needs to calculate h_t — vector which holds information for the current unit and passes it down to the network. In order to do that the update gate is needed. It determines what to collect from the current memory content — h’_t and what from the previous steps — h_(t-1). That is done as follows:

  1. Apply element-wise multiplication to the update gate z_t and h_(t-1).

  2. Apply element-wise multiplication to (1-z_t) and h’_t.

  3. Sum the results from step 1 and 2.

Let’s bring up the example about the book review. This time, the most relevant information is positioned in the beginning of the text. The model can learn to set the vector z_t close to 1 and keep a majority of the previous information. Since z_t will be close to 1 at this time step, 1-z_t will be close to 0 which will ignore big portion of the current content (in this case the last part of the review which explains the book plot) which is irrelevant for our prediction.

Here is an illustration which emphasises on the above equation:

Following through, you can see how z_t — green line is used to calculate 1-z_t which, combined with h’_t — bright green line, produces a result in the dark red line. z_t is also used with h_(t-1) — blue line in an element-wise multiplication. Finally, h_t — blue line is a result of the summation of the outputs corresponding to the bright and dark red lines.

Long Short Term Memory (LSTM)

From ChatGPT

Long Short-Term Memory, or LSTM, is a type of recurrent neural network (RNN) architecture designed to capture long-term dependencies in data. It's particularly useful in tasks where context from earlier inputs is crucial for understanding the current input.

Imagine you're reading a book, and a character is introduced on page 10. In a regular neural network, the information about that character might get lost as you progress through the book. But with LSTM, it's like having a really good bookmark that helps you remember details about the character even as the story unfolds.

LSTM achieves this by maintaining a memory cell that can store and retrieve information over long sequences. The cell has three main components: an input gate, a forget gate, and an output gate.

  1. Input Gate: Decides what information to store in the cell. For example, if you're reading a sentence, the input gate determines which words are essential to remember.

  2. Forget Gate: Decides what information to discard from the cell. Going back to our book example, the forget gate helps filter out less important details as the plot progresses.

  3. Output Gate: Decides what information to output from the cell. It's like the brain deciding what to express or use from its accumulated knowledge.

This way, LSTMs can selectively remember or forget information, allowing them to capture long-range dependencies in data. They've proven quite effective in various applications, such as natural language processing, speech recognition, and even in predicting stock prices.

From Medium

What is LSTM? You might have heard this term in the last interview you gave for a Machine Learning Engineer position, or some of your friends might have mentioned using LSTM in their predictive modelling projects. So the big question that may arise here is what is LSTM, the purpose of using LSTM in your projects, what type of projects can be made using the LSTM algorithm, etc. Do not worry. This article will cover the in-depth architecture of LSTM networks, how an LSTM works, and its application in the real world.

In the coming sections of this article, you will understand:

  • The architecture of RNNs and problems involved with time series forecasting using RNNs

  • Standard LSTM architecture and how to build a network of LSTM cells

What is Time-Series Forecasting?

Time Series Forecasting is a technique of using the time series data values and then using it to make predictions about future values on our historical data points.

Time series forecasting has many applications in the field of medical health(for preventing a disease), finance(for predicting future stock prices), weather forecasting(for predicting the weather in future), etc.

Let our time series data vector be:

T = [[t1], [t2], [t3],…….., [tn]]

our task is to predict or forecast the future values [[tn+1], [tn+2],…] based on the historical data i.e our time series data vector.

Note: Time Series Forecasting is a technique in Machine Learning using which we can analyze our sequence of ordered values of time to predict the outcome in the future, but as of now, there is no algorithm using which we can achieve human-like performance using machine learning for predictions has some limitations and drawbacks as well but for now, it is out of the course of this article as I aim to show the technique/algorithm which can be potentially used for getting good accuracy for time-series predictions.

Understanding the structure of RNN

Let us assume a sequence of data containing vectors:

x = [x(1), x(2), ….., x(t)] where each element x(i) is a vector.

When we train a simple generic Neural Network on the sequence of data, we generally pass all the information about the sequence of data in one go, i.e.:

σ(w(1)x(1) + w(2)x(2) + …… + w(t)x(t) + b)

[ where w(1), …,w(t) are weighted, σ is an activation function, and b is a bias value]

but this approach ignores any hidden patterns present in the sequence.

RNN stands for Recurrent Neural Networks. RNN is designed for processing any hidden pattern present in the data by considering the sequential nature of the data. RNN does not feed all the information to the network at once. Just like the traditional neural network, RNN has loops in it. Multiple copies of the same network can be considered, each passing a message to the next in order. If we unroll the loop, it forms a chain-like structure that allows one element to pass at a time, process it, feed in the second element in the sequence, and so on.

An RNN will take an input x(t) and output a value h(t) at any point. Because the loops in the network can pass information from one step to another, which helps the RNN remember past information. It can remember to use the past information in the present.

Multiple recurrent units form a chain-like structure.

Long-Term Dependencies problems in using RNN

RNN usually doesn’t face any problems connecting past information to the present task because of its chain-like structure formed due to loops in the network. Still, it is also possible that the gap between the relevant information in the past and the point in the present where it is to be needed becomes very large. In such cases, it could become challenging for RNN to learn to connect the information and find patterns in the data sequence. This is because of the Vanishing Gradient Problem.

What is the Vanishing Gradient Problem?

In backpropagation, the weight of the neural network is updated proportionally to the partial derivative of the error function for the current weights in each iteration of the training process.

But the problem arises when in some cases, the gradients will be vanishingly small, that the value of the weight does not change at all, and this may cause the neural network to completely stop further training of the network.

This led to the invention of the so-called LSTM.

Structure of a single LSTM cell

A simple Recurrent Neural Network has a straightforward structure that forms a chain of repeating modules of a neural network, with just a single activation function such as the tanh layer. Similarly, LSTM also has a chain-like structure with repeating modules like RNN. Still, instead of a single Neural network layer in RNN, LSTM has four layers interacting very differently, each performing its unique function in the network.

Each repeating module in an LSTM Cell has a cell state. The LSTM cell can add or remove the information to the cell state by using different gates in the cell. Gates will allow the information to be let into the cell state or will stop them from entering into the cell state. It does this with the usefulness of a sigmoid neural network layer and multiplication operation.

A sigmoid layer will output a number between 0 and 1, determining how much of the information should be let through the gate. Output value close to 0 will let nothing through the gate, whereas a value close to 1 will let the information through the gate.

Standard LSTM cell has three gates that control the amount of input or output information to/from the cell state and protects the cell state.

Understanding each of the gates of LSTM Cell

Let U = [0, 1] represent the unit interval and ±U = [-1, 1]. Let c be the cell state and h be the hidden state, respectively, then let L be a mathematical function that takes three inputs and produces 2 outputs.

Where h(t) and c(t)[cell state and hidden state at time T] is the output of the function L, whereas h(t-1), c(t-1), and x(t) [cell state and hidden state at time T and feature vector at T] is the input of the function L.

Both the outputs leave the cell at some time T and are then fed back to the cell at point T+1 along with the input sequence x(t).

Inside the cell, the input sequence vector x(t) and hidden state are fed to three gates, each of which produces a value in the range U with the help of the sigmoid function, which converts the values to be in between -1 to 1.

Our first gate is a Forget Gate Layer which decides how much of the current cell state we should forget. The sigmoid layer will output a number between 0 and 1. A value of 0 means forgets everything, whereas a value of 1 means not forgetting the information.

Next is an Input Gate Layer which will control what new information we will add to our cell state. This gate works in two parts; first, a sigmoid layer outputs a value we want to store in the cell state and then a tanh layer creates a vector of new feature values that can be added to the cell state.

At last, we have our Output Gate Layer, which will decide how much the updated cell state should be given as output. First, a sigmoid layer will decide what part of the cell state to output, and then the cell state is passed on to the tanh layer to output values between -1 to 1. We multiply the output of the sigmoid layer to the output of the tanh layer to get out the final output cell state which will become the next hidden state for the next layer of the cell.

where w(f,x), w(i,x), w(o,x), b(f), b(i), b(o) are weights vectors and biases for forget, input and output gates, respectively.

Another function for new feature values is constructed as a single neuron with a tanh activation function layer and is added to the cell state.

Where w(x) is additional weights to be learned during training and b is the bias value.

The final cell state and hidden state for the function L are represented by:

How to train the LSTM model and predict the future?

For time series forecasting, our training dataset will usually comprise single column dataframe values, i.e. A = [a1, a2, a3, ….., an]. Suppose the length of the vector A is l = 5. Input = [x(1), x(2), x(3), x(4), x(5)], and we want the output sequence to be of length one, as we know the LSTM model is recurrent in nature, the function S will be applied five times as shown below:

After feeding in the inputs, the error is calculated via a loss function. It is then backpropagated through the network to update the weights for the remaining iterations with the help of some gradient descent type scheme.

Step-1: Pre-processing

Step-2: Dividing data into train and test

Step-3: How to choose the size of the sliding window

Enough of theory, right?

LSTM Model in Python using TensorFlow and Keras

Now let us see how to code an LSTM Model in Python using TensorFlow and Keras, taking a straightforward example.

Steps:

  • Prepare the data

  • Feature Scaling (Preprocessing of data)

  • Split the dataset for train and test

  • Converting features into NumPy array and reshaping the array into a shape accepted by the LSTM model

  • Build the architecture for the LSTM network

  • Compile and fit the model (Training)

  • Evaluate the performance of the model(Test)

Import all the required python packages and libraries


import math
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, LSTM
import matplotlib.pyplot as plt
plt.style.use(‘fivethirtyeight’)

Output: Using TensorFlow backend.

Create a 2-D feature NumPy array with random integers

features = (np.random.randint(10, size=(100, 1)))
print(features.shape)

Output: (100, 1)

Split the dataset into 75/25 for train and test.

training_dataset_length = math.ceil(len(features) * .75)
print(training_dataset_length)

Output: 75

Preprocess the data, i.e. feature scaling to scale the data to be valued between 0 and 1, which is a good practice to scale the data before feeding it into a neural network for optimal performance.

scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(features)

Here we predict the 11th value using [1,2,….,10]. Here N = 100, and the sliding window size is l = 10. So x_train will contain values of sliding windows of l = 10, and y_train will contain values of every l+1 value we want to predict.

train_data = scaled_data[0:training_dataset_length , : ]
#Splitting the data
x_train=[]
y_train = []
for i in range(10, len(train_data)):
    x_train.append(train_data[i-10:i,0])
    y_train.append(train_data[i,0])

Then converting the x_train and y_train into NumPy array values and reshaping it into a 3-D array, shape accepted by the LSTM model.

x_train, y_train = np.array(x_train), np.array(y_train)
#Reshape the data into 3-D array
x_train = np.reshape(x_train, (x_train.shape[0],x_train.shape[1],1))

Build the architecture

  • Make an object of the sequential model. Then add the LSTM layer with parameters (units: the dimension of output space, input_shape: the shape of the training set, return_sequences: Ture or False, determines whether to return the final output in the output sequence or the entire sequence.

  • We add 4 LSTM layers, each with a dropout layer of value(0.2). {Droupout layer is a regularization technique used to prevent overfitting, but it may also increase training time in some cases.}

  • The final layer is the output layer, a fully connected dense layer(units = 1, as we are predicting only one value, i.e. l+1). {Dense layer performs the operation on the input layers and returns the output. Every neuron in the previous layer is connected to the neurons in the next layer; hence it is called the fully connected Dense layer.}

from keras.layers import Dropout
# Initialising the RNN
model = Sequential()
model.add(LSTM(units = 50, return_sequences = True, input_shape = (x_train.shape[1], 1)))
model.add(Dropout(0.2))
# Adding a second LSTM layer and Dropout layer
model.add(LSTM(units = 50, return_sequences = True))
model.add(Dropout(0.2))
# Adding a third LSTM layer and Dropout layer
model.add(LSTM(units = 50, return_sequences = True))
model.add(Dropout(0.2))
# Adding a fourth LSTM layer and and Dropout layer
model.add(LSTM(units = 50))
model.add(Dropout(0.2))
# Adding the output layer
# For Full connection layer we use dense
# As the output is 1D so we use unit=1
model.add(Dense(units = 1))

Compile the model using ‘adam optimizer’ (A learning rate optimization algorithm used while training DNN models). Error is calculated by the loss function ‘mean squared error’ ( as it is a regression problem, so we use the mean squared error loss function).

Then fit the model on 30 epoch(epochs are the number of times we pass the data into the neural network) and a batch size of 50(we pass the data in batches, segmenting the data into smaller parts so as for network to process the data in parts).

model.compile(optimizer = ‘adam’, loss = ‘mean_squared_error’)
model.fit(x_train, y_train, epochs = 30, batch_size = 50)

Crete test data similar to train data, convert to NumPy array and reshape the array to 3-D shape.

#Test data set
test_data = scaled_data[training_dataset_length — 10: , : ]
#splitting the x_test and y_test data sets
x_test = []
y_test = features[training_dataset_length : , : ]
for i in range(10,len(test_data)):
    x_test.append(test_data[i-10:i,0])
#Convert x_test to a numpy array
x_test = np.array(x_test)
#Reshape the data into 3-D array
x_test = np.reshape(x_test, (x_test.shape[0],x_test.shape[1],1))

Making the predictions and calculating the rmse score(smaller the rmse score, better the model has performed).

#check predicted values
predictions = model.predict(x_test)
#Undo scaling
predictions = scaler.inverse_transform(predictions)
#Calculate RMSE score
rmse=np.sqrt(np.mean(((predictions- y_test)**2)))
rmse

Output: 2.85533785512203

Conclusion

So in today’s article, you have learned many things. Let us go through each of them quickly for the one last time:

  • We now know what time-series forecasting is and how to deal with time-series data.

  • We now understand the structure of Recurrent neural networks, how it differs from generic Neural networks and the long-term dependency problem in RNN.

  • We don’t use RNN for time-series forecasting because of the Vanishing gradient problems in RNN.

  • Understanding the LSTM structure: Structure of a single LSTM cell.

  • Working on each of the gates of the LSTM and how to train the LSTM model.

  • Implementing all of the above in real-time using Tensorflow and Keras in python.

Bidirectional RNN(BRNN)

from ChatGPT

Let's look at a simple architecture of a Bidirectional Long Short-Term Memory (BiLSTM) network, which is a specific type of Bidirectional Recurrent Neural Network (BiRNN). BiLSTM is often used for sequential data tasks due to its ability to capture long-term dependencies.

Input Sequence: "The cat is on the mat."

                     Forward LSTM
                    /              \
Input -> LSTM Cell -> LSTM Cell -> ... -> LSTM Cell
                    \              /
                     Backward LSTM

                    Concatenation
                           |
                      Combined Output
                           |
                      Output Layer
                           |
                       Predictions

Here's a breakdown of the components:

  1. Input Sequence:

    • Each word in the sequence is represented as an input vector.

  2. Forward LSTM:

    • The input sequence is processed in the forward direction by a series of LSTM cells. Each LSTM cell maintains a hidden state, capturing information from the past.

  3. Backward LSTM:

    • Simultaneously, the same input sequence is processed in the backward direction by another set of LSTM cells. These cells capture information from the future.

  4. Concatenation:

    • The outputs of the forward and backward LSTMs at each time step are concatenated. This creates a combined representation that incorporates information from both directions.

  5. Output Layer:

    • The concatenated outputs are passed to an output layer, which can be a dense layer for classification tasks or a regression layer for numerical predictions.

  6. Predictions:

    • The final layer produces predictions based on the combined information from both the forward and backward passes.

This architecture allows the network to capture dependencies from both the past and future, enhancing its ability to understand and process sequential data effectively.

  1. Forward Pass:

    • The input sequence is processed in a forward direction, just like in a regular RNN. Each time step receives input from the previous time step and produces an output.

    • At each time step, the network maintains hidden states that capture the information learned from the past.

  2. Backward Pass:

    • Simultaneously, the input sequence is processed in a backward direction. The network now receives input from the future time steps, allowing it to capture information from the "future" context.

    • Similar to the forward pass, hidden states are updated, but this time they capture information from the future.

  3. Combining Information:

    • The key innovation in a BiRNN is the combination of information from both the forward and backward passes. At each time step, the outputs from both passes are concatenated or combined in some way.

    • This combined information is then used to make predictions or can be passed to subsequent layers in the neural network.

  4. Training:

    • During training, the network learns to adjust its parameters (weights and biases) through backpropagation. The gradients from both the forward and backward passes are combined to update the model.

Let's take a simplified example:

  • Input Sequence: "The cat is on the mat."

  • Forward Pass: "The" -> "cat" -> "is" -> "on" -> "the" -> "mat."

  • Backward Pass: "mat." -> "the" -> "on" -> "is" -> "cat" -> "The"

  • Combined Information: At each time step, information from both directions is combined, providing a holistic understanding of the sequence.

This bidirectional processing allows the network to capture dependencies from both past and future contexts, enabling better performance in tasks where a comprehensive understanding of the input sequence is essential.

Natural Language Processing & Word Embeddings

Properties of Word Embeddings

Word embeddings are vector representations of words that capture semantic relationships between words based on their context in a given dataset. Here are some key properties of word embeddings:

  1. Semantic Similarity:

    • Words with similar meanings have similar vector representations. This allows word embeddings to capture semantic relationships between words. For example, in a well-trained embedding model, the vectors for "king" and "queen" should be closer to each other than to words with different meanings.

  2. Vector Space Structure:

    • Word embeddings exist in a continuous vector space where distances and directions have meaning. Operations like vector addition and subtraction can be meaningful; for instance, the vector for "king" - "man" + "woman" might be close to the vector for "queen."

  3. Contextual Information:

    • Word embeddings are trained by considering the context in which words appear. This means that the meaning of a word is influenced by the words that surround it. This helps capture polysemy (multiple meanings) and context-dependent meanings.

  4. Dimensionality:

    • Word embeddings are typically represented as high-dimensional vectors (e.g., 50, 100, 300 dimensions). The choice of dimensionality depends on the specific application and the size of the training dataset.

  5. Word Algebra:

    • Word embeddings often exhibit interesting algebraic properties. For example, the vector representation of "king" minus "man" plus "woman" might be close to the vector representation of "queen." This allows for semantic relationships to be expressed as linear relationships within the vector space.

  6. Compositionality:

    • The vector representation of a phrase or sentence can be composed by combining the embeddings of individual words. This property allows word embeddings to be used for tasks beyond single-word similarity, such as sentence similarity and sentiment analysis.

  7. Transferability:

    • Pre-trained word embeddings can be transferred to downstream tasks with limited labeled data. This transferability is particularly useful in scenarios where training a model from scratch is not feasible due to data constraints.

  8. Training Methods:

    • Word embeddings can be trained using various methods, such as Word2Vec, GloVe, and more recently, transformer-based models like BERT and GPT. The training method can influence the properties of the resulting word embeddings.

  9. Polysemy Handling:

    • Word embeddings often exhibit the ability to handle polysemy by assigning different vector representations to words based on their context. For example, the word "bank" in the context of finance would have a different vector than the word "bank" in the context of a river.

  10. Word Analogies:

    • Word embeddings are known for their ability to solve word analogy problems. For instance, the relationship between "man" to "woman" is analogous to the relationship between "king" to "queen."

Understanding these properties is crucial for effectively using word embeddings in natural language processing tasks.

Embedding Matrix

In natural language processing (NLP), an embedding matrix is a crucial component used to convert words or tokens into dense vector representations, often referred to as word embeddings. This matrix is essentially a lookup table that maps each word/token in a given vocabulary to its corresponding embedding vector. The embedding vectors are trained in a way that captures semantic relationships between words based on their contextual usage.

Here's a step-by-step explanation of how an embedding matrix works in NLP:

1. Vocabulary and Indexing:

  • Vocabulary: Before training an embedding matrix, you need to create a vocabulary, which is a set of unique words or tokens present in your corpus.

  • Indexing: Each word in the vocabulary is assigned a unique index. For example, if your vocabulary is ["apple", "orange", "banana"], you might assign indices like {"apple": 0, "orange": 1, "banana": 2}.

2. Embedding Matrix Initialization:

  • Once you have your vocabulary and indices, you initialize an embedding matrix. The rows of this matrix correspond to the words in your vocabulary, and the columns correspond to the dimensions of the embedding space.

  • For example, if you choose a 50-dimensional embedding space, and your vocabulary has 1000 words, your embedding matrix might have dimensions 1000x50.

3. Training/Updating Embeddings:

  • The values in the embedding matrix are initially random. During training, these values are updated to minimize a certain objective function, such as predicting nearby words in a context window (as in Word2Vec) or optimizing a language modeling objective.

  • The training process adjusts the embedding vectors such that words with similar contexts have similar vectors.

4. Embedding Lookup:

  • Once the embedding matrix is trained, you can use it to convert words or tokens in your text into dense vectors. This is done by looking up the row corresponding to the index of each word in the embedding matrix.

  • For example, if the word "orange" has an index of 1, you look up the 2nd row in the embedding matrix to get the corresponding 50-dimensional vector for "orange."

import numpy as np
from tensorflow.keras.layers import Embedding

# Example vocabulary and indices
vocab = {"apple": 0, "orange": 1, "banana": 2}

# Example embedding matrix dimensions
vocab_size = len(vocab)
embedding_dim = 50

# Initializing the embedding matrix with random values
embedding_matrix = np.random.rand(vocab_size, embedding_dim)

# Example word index lookup
word_index = vocab["orange"]
embedding_vector = embedding_matrix[word_index]

# Using Keras Embedding layer
embedding_layer = Embedding(input_dim=vocab_size, output_dim=embedding_dim, trainable=True)
# The embedding matrix is now trainable during model training
Word2Vec

The Magic of Word Embeddings

Imagine if words had a secret code that revealed their meanings, making it easier for machines to understand them. That’s precisely what Word2Vec does. It transforms words into numerical representations called “word embeddings” by deciphering the hidden patterns in language. These word embeddings capture the essence of words, unlocking their meaning and context.

How It Works: The Word2Vec Journey

Word2Vec embarks on a captivating journey to make language more accessible to machines. Here’s how it works:

Data Collection: It all begins with collecting a vast amount of text data. This can be anything from books and articles to social media posts.

Data Preprocessing: Before diving in, the text data undergoes a meticulous makeover. It’s split into individual words, converted to lowercase, and any distracting punctuation is stripped away.

Vocabulary Building: A dictionary of sorts is created, containing every unique word from the preprocessed text. This becomes the palette from which Word2Vec will draw its linguistic artistry.

Data Transformation: The text is then transformed into pairs of words. Imagine each word as a star in the night sky, and these pairs as connections between those stars. For instance, in the sentence “I love ice cream,” Word2Vec would create pairs like (“love”, “I”), (“love”, “ice”), and (“love”, “cream”).

Neural Network Training: Word2Vec employs a neural network, much like the human brain, to understand these word connections. Two main approaches are used: Continuous Bag of Words (CBOW) and Skip-Gram. In CBOW, the model predicts a word based on its context, while Skip-Gram predicts the context from a word.

Vector Space Creation: This is where the magic happens. The neural network learns to map words into a high-dimensional vector space. Words that share similar contexts end up closer to each other in this space. It’s as if Word2Vec gives words a new home in this abstract world, where their neighbors reflect their meaning.

Word Embeddings: The result is a set of word embeddings, each like a key that unlocks the word’s meaning. These embeddings encode relationships and context, and they’re what make Word2Vec a game-changer in NLP.

Real-World Enchantments with Word2Vec

The applications of Word2Vec are nothing short of enchanting:

  • Document Retrieval: It helps find documents with similar content, aiding in search engines and recommendation systems.

  • Sentiment Analysis: Uncovering the emotions behind text, which is invaluable for understanding user feedback and social media sentiment.

  • Named Entity Recognition: Recognizing entities like names, places, and organizations in text, which is crucial for information extraction.

  • Machine Translation: Improving the accuracy of translating words based on their context.

  • Information Retrieval: Enhancing search engines to deliver more relevant results.

Unlocking the Secrets of Language

Word2Vec has cracked the code, allowing machines to delve into the secrets of language. With word embeddings, we’ve bridged the gap between human understanding and machine learning. As NLP continues to evolve, Word2Vec remains a powerful tool, opening doors to a world where language is not just data but a rich tapestry of meaning.

The next time you see a machine understanding the subtleties of your words, you’ll know that Word2Vec is the wizard behind the curtain, making magic happen in the world of language and technology.

Negative Sampling

Negative sampling is a technique used in word embedding algorithms, such as Word2Vec, to improve training efficiency and scalability. In the original Word2Vec training objective, the model is trained to distinguish the target word from all other words in the vocabulary, which can be computationally expensive. Negative sampling addresses this issue by sampling a small number of negative (non-context) words for each training instance, making the training process more efficient.

Here's an explanation of negative sampling with examples:

Original Training Objective (Softmax):

In the original Word2Vec training objective, a softmax function is used to calculate the probability distribution over the entire vocabulary for each training instance. Given a target word and its context, the softmax function is applied to all words in the vocabulary:

P(wiwcontext)=evwivwcontextj=1VevwjvwcontextP(w_i | w_{\text{context}}) = \frac{e^{v_{w_i} \cdot v_{w_{\text{context}}}}}{\sum_{j=1}^{V} e^{v_{w_j} \cdot v_{w_{\text{context}}}}}

Where:

  • wiw_i​ is a word in the vocabulary.

  • vwv_wis the vector representation (embedding) of word wiw_i​.

  • vwv_w is the vector representation of the context word.

  • VV is the size of the vocabulary.

The objective is to maximize the likelihood of the correct word given its context.

Negative Sampling:

In negative sampling, the training objective is modified to make the optimization task more computationally efficient. Instead of considering the entire vocabulary in the denominator of the softmax function, negative samples (words not in the context) are randomly selected.

The objective is to maximize the probability of the target word and minimize the probability of the negative samples.

P(wiwcontext)=evwivwcontext1+j=1kevwjvwcontextP(w_i | w_{\text{context}}) = \frac{e^{v_{w_i} \cdot v_{w_{\text{context}}}}}{1 + \sum_{j=1}^{k} e^{v_{w_j'} \cdot v_{w_{\text{context}}}}}

Where:

  • wjw_j is a randomly sampled negative word.

  • kk is the number of negative samples.

From Bard

Negative sampling is a technique used in natural language processing (NLP) to train word vectors efficiently. It is a method of approximating the probability of a word being in the context of another word, without having to calculate the probability of all possible word pairs.

Negative sampling is typically used in conjunction with the skip-gram model, which is a model that learns to predict the context of a given word. The skip-gram model is trained on a corpus of text, and it learns to represent each word as a vector of numbers. These vectors are called word vectors, and they capture the semantic and syntactic relationships between words.

To train the skip-gram model using negative sampling, we first randomly select a positive sample from the corpus. A positive sample is a word pair where the target word is in the context of the context word. For example, if the target word is "orange" and the context word is "juice", then this is a positive sample.

Next, we randomly select k negative samples from the corpus. Negative samples are word pairs where the target word is not in the context of the context word. For example, if the target word is "orange" and the context word is "apple", then this is a negative sample.

We then train the skip-gram model to predict whether a given word pair is a positive sample or a negative sample. This is done by updating the weights of the word vectors so that the model is more likely to predict a positive outcome for positive samples and a negative outcome for negative samples.

Negative sampling is a very efficient way to train word vectors, because it only requires us to calculate the probability of a small number of word pairs for each training iteration. This makes it possible to train word vectors on large corpora of text, which would be impractical using other methods.

Example

Here is an example of how negative sampling can be used to train word vectors:

Target word: "orange"
Context word: "juice"
Positive sample: ("orange", "juice")
Negative samples:
    ("orange", "apple")
    ("orange", "dinner")
    ("orange", "dog")
    ("orange", "chair")
    ("orange", "house")

We would train the skip-gram model to predict whether each of these word pairs is a positive sample or a negative sample. The model would initially be very bad at this task, but as we train it, it would learn to update the weights of the word vectors so that it is more likely to predict a positive outcome for positive samples and a negative outcome for negative samples.

By the end of the training process, the skip-gram model will have learned to represent the words "orange" and "juice" as vectors that are similar to each other. This is because these two words often appear together in the same context, which means that they have a similar meaning.

Negative sampling is a powerful technique that can be used to train word vectors efficiently and effectively. It is a core component of many popular NLP models, such as word2vec and GloVe.

Sequence Models & Attention Mechanism

Bleu Score

BLEU (Bilingual Evaluation Understudy) is a metric used for evaluating the quality of machine-generated text, especially in the context of machine translation. It was proposed as a way to measure the effectiveness of automated language translation systems.

Here's a brief explanation of BLEU, along with examples:

BLEU Score Calculation:

  1. N-gram Precision:

    • BLEU measures precision by comparing the n-grams (contiguous sequences of n items, usually words) in the machine-generated translation to those in a reference (human-generated) translation.

  2. Brevity Penalty:

    • BLEU penalizes short translations, as shorter translations tend to get higher precision scores. A brevity penalty is applied to the score to account for this.

  3. Cumulative BLEU Score:

    • BLEU is often computed for multiple values of n (usually up to 4), and a cumulative BLEU score is calculated as the geometric mean of the individual n-gram precisions.

Example:

Consider a machine-generated translation and two reference translations:

  • Machine-generated translation (Hypothesis): "The cat in the hat."

  • Reference 1: "The cat is wearing a hat."

  • Reference 2: "The cat is in the hat."

Now, let's calculate BLEU scores for bigrams (2-grams):

  1. Counting N-grams:

    • Hypothesis bigrams: [("The", "cat"), ("cat", "in"), ("in", "the"), ("the", "hat")]

    • Reference 1 bigrams: [("The", "cat"), ("cat", "is"), ("is", "wearing"), ("wearing", "a"), ("a", "hat")]

    • Reference 2 bigrams: [("The", "cat"), ("cat", "is"), ("is", "in"), ("in", "the"), ("the", "hat")]

  2. Calculating Precision:

    • Precision = (Number of overlapping bigrams in Hypothesis and References) / (Total number of bigrams in Hypothesis)

    • Precision = 2 / 4 = 0.5

  3. Brevity Penalty:

    • Brevity Penalty = min(1, exp(1 - (Number of words in Hypothesis / Number of words in the reference with the closest length)))

    • Brevity Penalty = min(1, exp(1 - 6/6)) = 1

  4. Cumulative BLEU Score (Bigrams):

    • BLEU = Brevity Penalty * exp((1/2) * (log(Precision_1) + log(Precision_2)))

    • BLEU = 1 * exp((1/2) * (log(0.5)))

    • BLEU ≈ 0.707

So, the BLEU score for this example, considering bigrams, is approximately 0.707. Higher BLEU scores (closer to 1) indicate better translation quality.

Attention Model

The attention mechanism is a key component in many state-of-the-art natural language processing (NLP) and computer vision models. It enables models to focus on different parts of input sequences with varying degrees of attention, allowing them to weigh the importance of different elements when making predictions.

In the context of NLP, let's consider a simple example of a machine translation task where we want to translate a sentence from English to French. Traditional sequence-to-sequence models without attention mechanisms process the entire input sequence and produce a fixed-size context vector, which is then used to generate the entire output sequence. However, this approach can be limiting, especially for long sentences or when specific words need more emphasis.

With attention mechanisms, the model dynamically selects which parts of the input sequence are more relevant at each step of generating the output sequence. Here's a simplified step-by-step explanation:

  1. Encoding:

    • The input sequence (e.g., English sentence) is initially processed by an encoder, typically a recurrent neural network (RNN) or a transformer.

    • Each word in the input sequence is associated with a hidden state.

  2. Calculating Attention Weights:

    • For each word in the output sequence (e.g., French translation), the model calculates attention weights.

    • These weights indicate how much focus or attention should be given to each word in the input sequence when predicting the current word in the output sequence.

    • The calculation often involves comparing the current decoder hidden state with each encoder hidden state.

  3. Weighted Sum:

    • The attention weights are used to compute a weighted sum of the encoder hidden states.

    • This weighted sum represents the context vector, which is specific to the current word being generated in the output sequence.

  4. Decoding:

    • The context vector is combined with the decoder's input (the previously generated word) to predict the next word in the output sequence.

    • The process is repeated iteratively for each word in the output sequence.

Here's a simplified mathematical representation:

Attention(Q,K,V)=softmax(QKTdk)VAttention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V

  • QQ: Query vector (typically the decoder hidden state)

  • KK: Key vector (typically the encoder hidden states)

  • VV: Value vector (typically the encoder hidden states)

  • softmaxsoftmax: Function to normalize attention weights

  • dkd_k​: Dimensionality of the key vectors

This mechanism allows the model to attend to different parts of the input sequence when generating each word in the output sequence, improving the overall performance, especially for long and complex sentences.

Attention mechanisms are not limited to machine translation and are widely used in various tasks, such as text summarization, image captioning, and speech recognition.

Attention Mechanism:

Given a sequence of input vectors X=x1,x2,,xnX={x_1,x_2 ,…,x_n} and a query vector qq, the attention mechanism assigns weights to each element in the sequence based on its relevance to the query.

1. Compute Attention Scores:

Calculate the attention scores for each element in the sequence:

ei=score(q,xi)e_i = \text{{score}}(q, x_i)

The score function can be implemented in various ways, and a common one is the dot product:

ei=qxie_i = q \cdot x_i

2. Calculate Attention Weights:

Normalize the attention scores using the softmax function to get attention weights:

αi=eeij=1neej\alpha_i = \frac{{e^{e_i}}}{{\sum_{j=1}^{n} e^{e_j}}}

3. Compute the Weighted Sum:

Compute the weighted sum of the input sequence vectors using the attention weights:

Attention(q,X)=i=1nαixi\text{{Attention}}(q, X) = \sum_{i=1}^{n} \alpha_i \cdot x_i

Transformer Network

Self-Attention

A self-attention model, also known as the Transformer architecture, was introduced in the paper "Attention is All You Need" by Vaswani et al. in 2017. It has since become a foundational architecture for various natural language processing tasks due to its ability to capture long-range dependencies and parallelize computation efficiently.

In a self-attention model, the input sequence is represented as a set of vectors, often referred to as embeddings. The key idea is to calculate attention scores between each pair of elements in the sequence, allowing the model to focus more on relevant parts of the input when making predictions.

Here's a high-level overview of the self-attention mechanism:

  1. Input Representation:

    • Let (X=x1,x2,...,xn)(X = {x_1, x_2, ..., x_n}) be the input sequence of vectors (embeddings).

  2. Self-Attention Calculation:

    • For each position (i), calculate the attention score aija_{ij} between the element at position (i) and all other elements in the sequence. This is done using the following equation:

      aij=esijk=1nesika_{ij} = \frac{{e^{s_{ij}}}}{{\sum_{k=1}^{n} e^{s_{ik}}}}

      where sijs_{ij} is a scaled dot-product score:

    sij=QiKjTdks_{ij} = \frac{{Q_i \cdot K_j^T}}{{\sqrt{d_k}}}

    Here, QiQ_i, KjK_j, and VjV_j are the query, key, and value vectors, respectively. These vectors are linear projections of the input vectors xix_i using learned weight matrices WQW_Q, WKW_K, and WvW_v).

  3. Weighted Sum:

    • Use the attention scores to compute a weighted sum of the values:

      Attention(Q,K,V)=j=1naijVj\text{{Attention}}(Q, K, V) = \sum_{j=1}^{n} a_{ij} \cdot V_j

    • This weighted sum is the attended representation for the element at position (i).

  4. Multi-Head Attention:

    • The self-attention mechanism is often extended with multiple heads, each with its own set of learned weight matrices. The outputs from each head are concatenated and linearly transformed to produce the final output.

The self-attention mechanism allows the model to weigh the importance of different elements in the input sequence for each element in the output sequence, capturing complex dependencies. The use of multiple heads enables the model to attend to different aspects of the relationships within the input sequence.

It's important to note that the self-attention mechanism is a key component of the Transformer architecture, which also includes positional encoding and feedforward layers. The complete Transformer model is widely used in various natural language processing tasks, such as machine translation, text summarization, and language understanding.

Step 1 Q、K、V

Step 2 MatMul

Step 3 and 4 Scale + Softmax

Step 4 MatMul

Output

Self Attention 和 RNN、LSTM 的区别

引入 Self Attention 有什么好处呢?或者说通过 Self Attention 到底学到了哪些规律或者抽取出了哪些特征呢?我们可以通过下述两幅图来讲解:

从上述两张图可以看出,Self Attention 可以捕获同一个句子中单词之间的一些句法特征(例如第一张图展示的有一定距离的短语结构)或者语义特征(例如第二张图展示的 its 的指代对象为 Law)。

有了上述的讲解,我们现在可以来看看 Self Attention 和 RNN、LSTM 的区别:

  • RNN、LSTM:如果是 RNN 或者 LSTM,需要依次序列计算,对于远距离的相互依赖的特征,要经过若干时间步步骤的信息累积才能将两者联系起来,而距离越远,有效捕获的可能性越小

  • Self Attention:

    • 通过上述两幅图,很明显的可以看出,引入 Self Attention 后会更容易捕获句子中长距离的相互依赖的特征,因为 Self Attention 在计算过程中会直接将句子中任意两个单词的联系通过一个计算步骤直接联系起来,所以远距离依赖特征之间的距离被极大缩短,有利于有效地利用这些特征

    • 除此之外,Self Attention 对于一句话中的每个单词都可以单独的进行 Attention 值的计算,也就是说 Self Attention 对计算的并行性也有直接帮助作用,而对于必须得依次序列计算的 RNN 而言,是无法做到并行计算的。

从上面的计算步骤和图片可以看出,无论句子序列多长,都可以充分捕获近距离上往下问中的任何依赖关系,进而可以很好的提取句法特征还可以提取语义特征;而且对于一个句子而言,每个单词的计算是可以并行处理的

理论上 Self-Attention (Transformer 50 个左右的单词效果最好)解决了 RNN 模型的长序列依赖问题,但是由于文本长度增加时,训练时间也将会呈指数增长,因此在处理长文本任务时可能不一定比 LSTM(200 个左右的单词效果最好) 等传统的 RNN 模型的效果好。

上述所说的,则是为何 Self Attention 逐渐替代 RNN、LSTM 被广泛使用的原因所在。

Multi-Head Attention

Multi-head attention is an extension of the basic attention mechanism in neural networks, particularly prominent in the Transformer model. It involves using multiple attention heads to capture different aspects of relationships within the input sequence. Each attention head operates independently, and their outputs are typically concatenated and linearly transformed.

Multi-Head Attention:

Given an input sequence X=x1,x2,,xnX={x_1​,x_2​,…,x_n​} and a set of queries, keys, and values (QQ, KK, VV), multi-head attention is computed as follows:

1. Linear Projection:

For each attention head hh, linearly project the queries (QQ), keys (KK), and values (VV) using learned weight matrices (WQh​​W_{Q_h}​​, WKh​​W_{K_h}​​, WVh​​W_{V_h}​​):

qh=QWQh,kh=KWKh,vh=VWVhq_h = Q \cdot W_{Q_h}, \quad k_h = K \cdot W_{K_h}, \quad v_h = V \cdot W_{V_h}

2. Apply Scaled Dot-Product Attention:

For each attention head hh, apply the scaled dot-product attention mechanism:

Attention(qh,kh,vh)=softmax(qhkhTdk)vh\text{{Attention}}(q_h, k_h, v_h) = \text{{softmax}}\left(\frac{{q_h \cdot k_h^T}}{{\sqrt{d_k}}}\right) \cdot v_h

Here, dkd_k​ is the dimension of the key vectors.

3. Concatenate and Linear Transformation:

Concatenate the outputs from all attention heads and apply a linear transformation with a learned weight matrix (WOW_O​):

MultiHead(Q,K,V)=Concat(head1,head2,,headh)WO\text{MultiHead}(Q,K,V) = \text{Concat} (\text{head}_1, \text{head}_2, \ldots, \text{head}_h)\cdot W_O

Position Embedding

Position embedding is a technique used in natural language processing and computer vision tasks to incorporate the sequential or spatial information of tokens or elements in a sequence. In the context of transformers, position embedding helps the model distinguish the order or position of tokens within a sequence. The most common approach is to add position embeddings to the input embeddings before feeding them into the transformer model.

Let's consider the 1D case for simplicity, where we have a sequence of tokens. The position embedding for each position ii in the sequence can be represented as a vector PEiPE_i​. The final input embedding for a token at position ii is obtained by adding the token's embedding XiX_i​ with its corresponding position embedding:

Input Embeddingi=Xi+PEi\text{Input Embedding}_i ​ =X_i ​ +PE_i ​

One common way to compute the position embeddings is by using trigonometric functions, such as sine and cosine functions. The position embedding for each position ii and each dimension dd can be computed as follows:

PEi,2d=sin(i100002d/dmodel)\text{PE}_{i, 2d} = \sin\left(\frac{i}{10000^{2d/d_{\text{model}}}}\right)

PEi,2d+1=cos(i100002d/dmodel)\text{PE}_{i, 2d+1} = \cos\left(\frac{i}{10000^{2d/d_{\text{model}}}}\right)

Here, dmodeld_{model} is the dimensionality of the model, and 2d2d and 2d+12d+1 represent the even and odd indices of the position embedding vector. The division by 100002d/dmodel10000^{2d/d_{model}} is a scaling factor to ensure that the sine and cosine values vary smoothly across positions.

Transformer

Reference

Last updated