Study Card: Positional Encoding in Transformers

Direct Answer

Positional encoding is crucial in transformer models because it provides information about the position of words in the input sequence, which is not inherently captured by the self-attention mechanism. Self-attention treats all words in a sequence as a set, ignoring their order. Positional encodings are added to the input word embeddings, allowing the model to differentiate between word orderings and understand the sequential structure of the input. This is vital for tasks like natural language understanding and machine translation where word order significantly affects meaning. The key issue it addresses is the order-invariance of self-attention. By adding positional information, the model can accurately represent word order and contextual relationships based on sequence position.

Key Terms

Example

Consider the sentences "Dog bites man" and "Man bites dog." Without positional encoding, a transformer would treat these sentences identically because the set of words is the same. Positional encoding adds information about the position of "dog" and "man," enabling the model to differentiate between these two sentences with opposite meanings.

Code Implementation

import torch
import torch.nn as nn
import math

# Example implementation of absolute positional encoding

def positional_encoding(max_len, d_model):
    """
    Generates positional encodings for a sequence.

    Args:
        max_len: Maximum sequence length.
        d_model: Embedding dimension.

    Returns:
        Tensor of shape (max_len, d_model).
    """
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

# Example Usage:
max_len = 50
d_model = 512
pe = positional_encoding(max_len, d_model)

print(pe.shape)  # Output: torch.Size([max_len, d_model])

# How to use it with word embeddings:
batch_size = 4
seq_len = 10
word_embeddings = torch.randn(batch_size, seq_len, d_model) # Example word embeddings

# Add positional encodings to word embeddings (assuming seq_len <= max_len)
encoded_embeddings = word_embeddings + pe[:seq_len, :].unsqueeze(0) # adding positional embedding with broadcasting

print(encoded_embeddings.shape) # Output: torch.Size([batch_size, seq_len, d_model])

Related Concepts