Study Card: Limitations of RNNs Addressed by Transformers

Direct Answer

Transformers address key limitations of Recurrent Neural Networks (RNNs), particularly in handling long sequences and parallelization. RNNs process input sequentially, leading to vanishing gradients and difficulty capturing long-range dependencies. Transformers, using self-attention, process the entire sequence simultaneously, enabling parallel computation and capturing relationships between distant words more effectively. This also significantly improves training speed. Key improvements include parallelization, handling long-range dependencies, and mitigating vanishing gradients.

Key Terms

Example

Consider the sentence "The cat, which was sitting on the mat, purred softly." An RNN would process each word sequentially, making it difficult to capture the relationship between "cat" and "purred" due to the intervening words. The information about the "cat" might be diluted or lost by the time the RNN processes "purred." A transformer, however, can directly attend to "cat" when processing "purred," regardless of the distance between them, using the self-attention mechanism. This parallel processing also drastically reduces training time compared to a sequential RNN.

Code Implementation

import torch
import torch.nn as nn

# Simplified example demonstrating parallel computation with self-attention

class SelfAttention(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(SelfAttention, self).__init__()
        self.query = nn.Linear(input_dim, hidden_dim)
        self.key = nn.Linear(input_dim, hidden_dim)
        self.value = nn.Linear(input_dim, hidden_dim)

    def forward(self, x):
        # x: (sequence_length, batch_size, input_dim)
        Q = self.query(x)  # (sequence_length, batch_size, hidden_dim)
        K = self.key(x)    # (sequence_length, batch_size, hidden_dim)
        V = self.value(x)  # (sequence_length, batch_size, hidden_dim)

        # Calculate attention weights
        attention_scores = torch.bmm(Q.transpose(0, 1), K.transpose(0, 1).transpose(1, 2)) # (batch_size, sequence_length, sequence_length)
        attention_weights = torch.softmax(attention_scores, dim=-1)

        # Weighted sum of values
        output = torch.bmm(attention_weights, V.transpose(0, 1)).transpose(0, 1)  # (sequence_length, batch_size, hidden_dim)
        return output

# Example usage
input_dim = 512
hidden_dim = 256
sequence_length = 10
batch_size = 32

x = torch.randn(sequence_length, batch_size, input_dim) # Example input
attention_layer = SelfAttention(input_dim, hidden_dim)
output = attention_layer(x)
print(output.shape) # Output shape: (sequence_length, batch_size, hidden_dim)

# RNN Example (for comparison - showing sequential processing)
rnn = nn.RNN(input_dim, hidden_dim, batch_first=False)
output, hidden = rnn(x) # Processes sequentially
print(output.shape) # Output shape: (sequence_length, batch_size, hidden_dim)

Related Concepts