Study Card: Handling Out-of-Vocabulary (OOV) Words in LLM Embeddings

Direct Answer

Handling OOV words in LLM embedding generation involves strategies like using subword tokenization (BPE, WordPiece), character-level embeddings, or fallback mechanisms. Subword tokenization breaks down OOV words into smaller, known units, while character-level embeddings construct word representations from character embeddings. Fallback mechanisms assign default vectors, use a special <UNK> token, or leverage external resources to generate embeddings for unknown words. Choosing the right strategy depends on the specific application and LLM architecture.

Key Terms

Out-of-Vocabulary (OOV) Words: Words that were not present in the LLM's training vocabulary and therefore do not have pre-trained embeddings.
Subword Tokenization: Breaking down words into smaller units (subwords), such as Byte Pair Encoding (BPE) or WordPiece, which are often present in the training vocabulary even if the full word is not.
Character-Level Embeddings: Constructing word embeddings by combining embeddings of individual characters within the word, allowing the model to handle OOV words composed of known characters.
Fallback Mechanisms: Strategies for handling OOV words when subword or character embeddings are insufficient, such as using a default vector, a special <UNK> token, or querying external resources.

Example

Consider an LLM encountering the word "biotechnological" during text generation. If this word is OOV, subword tokenization might break it down into "bio," "tech," "nological," which could have corresponding embeddings. Character-level embeddings would combine embeddings for each character ('b', 'i', 'o', 't', 'e', 'c', 'h', ...). As a fallback, if none of the subword units exist, the tokenizer might return an <UNK> token with a learned embedding representing unknown words. Alternatively, one could query a knowledge base to retrieve information related to the OOV word and generate an embedding based on that retrieved content. This allows the LLM to generate somewhat meaningful representations of previously unseen words.

Code Implementation

import transformers
import torch

# Example: Using subword tokenization with Hugging Face Transformers
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
text = "This is a sentence with an out-of-vocabulary word: biotechnological."
tokens = tokenizer(text)
print(tokens) # Observe subword tokenization of "biotechnological"

# Example: Handling UNK tokens
text = "This sentence has <UNK> words."
tokens = tokenizer(text)
print(tokens)

# Example illustrative implementation of character level embeddings
def char_embedding(word, char_embedding_dict, embedding_dim):
    """Generates character level embedding by averaging character embeddings."""
    word_embedding = torch.zeros(embedding_dim)
    char_count = 0
    for char in word:
        if char in char_embedding_dict:
            word_embedding += char_embedding_dict[char]
            char_count += 1
    if char_count > 0:
        word_embedding /= char_count
    return word_embedding

# Example usage (illustrative):
char_embedding_dict = {'a': torch.randn(100), 'b': torch.randn(100), 'c': torch.randn(100)} # Pretrained character embeddings
word = "abc"
word_emb = char_embedding(word, char_embedding_dict, 100)
print(word_emb.shape)

# Example: Fallback with default vector for OOV words
def get_embedding(word, word_embeddings, default_vector):
    """Retrieves word embedding or returns default vector if OOV."""
    return word_embeddings.get(word, default_vector)

# Example usage
word_embeddings = {"hello": torch.randn(300), "world": torch.randn(300)}
default_vector = torch.zeros(300)
embedding = get_embedding("biotechnological", word_embeddings, default_vector)
print(embedding)

Related Concepts

Byte Pair Encoding (BPE): A common subword tokenization algorithm used in LLMs like GPT. Interviewers might ask about the details of BPE.
WordPiece: Another subword tokenization method used in models like BERT. Potential questions may involve comparing BPE and WordPiece.
Vocabulary Size: The number of unique tokens in the LLM's vocabulary, which influences the likelihood of encountering OOV words. Interviewers could discuss the trade-off between vocabulary size and model complexity.
Morphological Analysis: Analyzing the structure and formation of words, which can be used to generate embeddings for OOV words based on their morphemes. Possible follow up could involve applying morphological analysis to improve OOV handling.
FastText: A library for efficient text classification and representation learning that uses subword information to generate embeddings for OOV words. Interviewers might discuss the advantages and disadvantages of FastText's approach.