Sparse loss functions like cross-entropy are commonly used for classification tasks where the target is a single class label (or a probability distribution over a set of classes). They compute the loss only for the true class, making them computationally efficient for large vocabulary sizes. Dense loss functions like mean squared error (MSE) are used for regression tasks where the target is a continuous value. In language modeling, using MSE with one-hot encoded targets becomes computationally expensive due to the large vocabulary size and leads to less sharp probability distributions. Key differences include target representation (class labels vs. continuous values), computational cost (efficient for sparse, expensive for dense with large vocabularies), and the sharpness of predicted probability distributions (sharper with cross-entropy).
In next-word prediction with a vocabulary of 10,000 words, if the true next word is "cat" (represented as index 200), cross-entropy would only consider the predicted probability for the 200th index. MSE with one-hot encoding would calculate the squared error between the predicted probabilities and the one-hot vector representing "cat" (a vector of length 10,000 with only the 200th element being 1), incurring a significantly higher computational cost.
import torch
import torch.nn as nn
# Example demonstrating cross-entropy and MSE loss
# Sample predictions and target
predictions = torch.randn(1, 5, requires_grad=True) # Batch size 1, 5 classes
target_class = torch.tensor([2]) # True class label (index 2)
target_one_hot = torch.nn.functional.one_hot(target_class, num_classes=5).float() # One hot encoding of target
# Cross-entropy loss
loss_cross_entropy = nn.CrossEntropyLoss()(predictions, target_class)
# MSE loss (with one-hot encoding)
loss_mse = nn.MSELoss()(predictions, target_one_hot)
print("Cross-entropy Loss:", loss_cross_entropy.item())
print("MSE Loss:", loss_mse.item())
# Example showing probability distribution
import torch.nn.functional as F
predictions_softmax = F.softmax(predictions, dim=-1)
print("Predicted probabilities (softmax):", predictions_softmax)
# Experiment: Increase the number of classes (vocabulary size)
# and observe the increase in computation time for MSE loss.