Study Card: Limitations of RNNs Addressed by Transformers

Direct Answer

Transformers address key limitations of Recurrent Neural Networks (RNNs), particularly in handling long sequences and parallelization. RNNs process input sequentially, leading to vanishing gradients and difficulty capturing long-range dependencies. Transformers, using self-attention, process the entire sequence simultaneously, enabling parallel computation and capturing relationships between distant words more effectively. This also significantly improves training speed. Key improvements include parallelization, handling long-range dependencies, and mitigating vanishing gradients.

Key Terms

Recurrent Neural Network (RNN): A neural network architecture designed for sequential data, processing input step-by-step with a hidden state that carries information from previous steps.
Vanishing Gradients: A problem in deep learning where gradients become very small during backpropagation, hindering the learning of long-range dependencies.
Long-Range Dependencies: Relationships between words that are far apart in a sequence.
Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence when processing each word, enabling parallel computation and capturing long-range dependencies.
Parallelization: The ability to perform computations simultaneously, leading to faster training.

Example

Consider the sentence "The cat, which was sitting on the mat, purred softly." An RNN would process each word sequentially, making it difficult to capture the relationship between "cat" and "purred" due to the intervening words. The information about the "cat" might be diluted or lost by the time the RNN processes "purred." A transformer, however, can directly attend to "cat" when processing "purred," regardless of the distance between them, using the self-attention mechanism. This parallel processing also drastically reduces training time compared to a sequential RNN.

Code Implementation

import torch
import torch.nn as nn

# Simplified example demonstrating parallel computation with self-attention

class SelfAttention(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(SelfAttention, self).__init__()
        self.query = nn.Linear(input_dim, hidden_dim)
        self.key = nn.Linear(input_dim, hidden_dim)
        self.value = nn.Linear(input_dim, hidden_dim)

    def forward(self, x):
        # x: (sequence_length, batch_size, input_dim)
        Q = self.query(x)  # (sequence_length, batch_size, hidden_dim)
        K = self.key(x)    # (sequence_length, batch_size, hidden_dim)
        V = self.value(x)  # (sequence_length, batch_size, hidden_dim)

        # Calculate attention weights
        attention_scores = torch.bmm(Q.transpose(0, 1), K.transpose(0, 1).transpose(1, 2)) # (batch_size, sequence_length, sequence_length)
        attention_weights = torch.softmax(attention_scores, dim=-1)

        # Weighted sum of values
        output = torch.bmm(attention_weights, V.transpose(0, 1)).transpose(0, 1)  # (sequence_length, batch_size, hidden_dim)
        return output

# Example usage
input_dim = 512
hidden_dim = 256
sequence_length = 10
batch_size = 32

x = torch.randn(sequence_length, batch_size, input_dim) # Example input
attention_layer = SelfAttention(input_dim, hidden_dim)
output = attention_layer(x)
print(output.shape) # Output shape: (sequence_length, batch_size, hidden_dim)

# RNN Example (for comparison - showing sequential processing)
rnn = nn.RNN(input_dim, hidden_dim, batch_first=False)
output, hidden = rnn(x) # Processes sequentially
print(output.shape) # Output shape: (sequence_length, batch_size, hidden_dim)

Related Concepts

Attention Mechanisms: Deep dive into different types of attention (scaled dot-product attention, multi-head attention). Interviewers might ask you to explain the details of these mechanisms.
Transformer Architecture: Understanding the overall architecture of a transformer model, including encoder-decoder structure, positional encoding, and feed-forward networks, is important.
Sequence-to-Sequence Models: Both RNNs and transformers are used in sequence-to-sequence tasks. Interviewers might ask about different applications and compare the performance of these models.
Parallel Computing: The ability of transformers to parallelize computation is a key advantage. Interviewers might ask about the hardware implications and speed-ups achieved.
BERT and GPT: These are specific examples of transformer models. Understanding their architectures and how they utilize the advantages of transformers is important.