Study Card: Evaluating LLM Embeddings

Direct Answer

Evaluating LLM embeddings involves assessing their ability to capture semantic relationships between text. Metrics like cosine similarity, Euclidean distance, and ranked correlation (Spearman's rho) can quantify similarity between embedding vectors. For semantic similarity tasks, benchmark datasets with human-labeled similarity scores are used to compare model performance. In information retrieval, metrics like precision, recall, and mean average precision (MAP) measure how well embeddings retrieve relevant information. Evaluating on downstream tasks and visualizing embeddings can provide further insights into their quality.

Key Terms

Example

Consider evaluating embeddings generated by an LLM for sentence similarity. The model embeds two sentences, "The cat sat on the mat" and "The feline rested on the rug." Cosine similarity between these embeddings reflects semantic closeness. Comparing these similarity scores against human judgments on a benchmark dataset like STS Benchmark provides a quantitative measure. In information retrieval, given a query, the model retrieves documents with embeddings closest to the query embedding. MAP evaluates the ranking of retrieved documents based on relevance. Visualizing embeddings of related concepts using t-SNE can qualitatively assess if the embeddings capture semantic relationships, revealing clusters of similar concepts.

Code Implementation

import torch
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import spearmanr
from sklearn.metrics import average_precision_score

# Example: Calculating cosine similarity
embeddings1 = torch.randn(10, 512)  # Batch of 10 embeddings, dimension 512
embeddings2 = torch.randn(10, 512)

cosine_sim = cosine_similarity(embeddings1.numpy(), embeddings2.numpy())
print("Cosine Similarity:")
print(cosine_sim)

# Example: Calculating Spearman's rho
human_similarity_scores = [4, 3, 5, 1, 2]
model_similarity_scores = [0.8, 0.7, 0.9, 0.2, 0.5] # Using cosine similarity output from before.
rho, p_value = spearmanr(human_similarity_scores, model_similarity_scores)
print(f"Spearman's rho: {rho}")

# Example: Calculating Mean Average Precision (MAP) in Information Retrieval
true_relevance = [1, 0, 1, 1, 0]  # True relevance of retrieved documents (1: relevant, 0: irrelevant)
retrieved_scores = [0.9, 0.8, 0.7, 0.6, 0.5]  # Scores assigned by the model (e.g., similarity to query)
map_score = average_precision_score(true_relevance, retrieved_scores)
print(f"Mean Average Precision (MAP): {map_score}")

# Example: Visualizing embeddings with t-SNE (using scikit-learn)
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

embeddings = torch.randn(100, 768) # Example embeddings
tsne = TSNE(n_components=2, perplexity=30, n_iter=300)
embeddings_2d = tsne.fit_transform(embeddings.numpy())

plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1]) # Plotting in 2D
plt.show()

Related Concepts