Handling OOV words in LLM embedding generation involves strategies like using subword tokenization (BPE, WordPiece), character-level embeddings, or fallback mechanisms. Subword tokenization breaks down OOV words into smaller, known units, while character-level embeddings construct word representations from character embeddings. Fallback mechanisms assign default vectors, use a special <UNK> token, or leverage external resources to generate embeddings for unknown words. Choosing the right strategy depends on the specific application and LLM architecture.
<UNK> token, or querying external resources.Consider an LLM encountering the word "biotechnological" during text generation. If this word is OOV, subword tokenization might break it down into "bio," "tech," "nological," which could have corresponding embeddings. Character-level embeddings would combine embeddings for each character ('b', 'i', 'o', 't', 'e', 'c', 'h', ...). As a fallback, if none of the subword units exist, the tokenizer might return an <UNK> token with a learned embedding representing unknown words. Alternatively, one could query a knowledge base to retrieve information related to the OOV word and generate an embedding based on that retrieved content. This allows the LLM to generate somewhat meaningful representations of previously unseen words.
import transformers
import torch
# Example: Using subword tokenization with Hugging Face Transformers
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
text = "This is a sentence with an out-of-vocabulary word: biotechnological."
tokens = tokenizer(text)
print(tokens) # Observe subword tokenization of "biotechnological"
# Example: Handling UNK tokens
text = "This sentence has <UNK> words."
tokens = tokenizer(text)
print(tokens)
# Example illustrative implementation of character level embeddings
def char_embedding(word, char_embedding_dict, embedding_dim):
"""Generates character level embedding by averaging character embeddings."""
word_embedding = torch.zeros(embedding_dim)
char_count = 0
for char in word:
if char in char_embedding_dict:
word_embedding += char_embedding_dict[char]
char_count += 1
if char_count > 0:
word_embedding /= char_count
return word_embedding
# Example usage (illustrative):
char_embedding_dict = {'a': torch.randn(100), 'b': torch.randn(100), 'c': torch.randn(100)} # Pretrained character embeddings
word = "abc"
word_emb = char_embedding(word, char_embedding_dict, 100)
print(word_emb.shape)
# Example: Fallback with default vector for OOV words
def get_embedding(word, word_embeddings, default_vector):
"""Retrieves word embedding or returns default vector if OOV."""
return word_embeddings.get(word, default_vector)
# Example usage
word_embeddings = {"hello": torch.randn(300), "world": torch.randn(300)}
default_vector = torch.zeros(300)
embedding = get_embedding("biotechnological", word_embeddings, default_vector)
print(embedding)