Study Card: L1 vs L2 Regularization

Direct Answer

L1 (Lasso) and L2 (Ridge) regularization are techniques to prevent overfitting by adding a penalty term to the loss function, but they differ in the type of penalty applied. L1 regularization adds a penalty proportional to the absolute value of the coefficients, promoting sparsity by driving some coefficients to exactly zero. L2 regularization adds a penalty proportional to the square of the coefficients, shrinking all coefficients towards zero but rarely making them exactly zero. The choice between L1 and L2 depends on whether feature selection is desired (L1) or if all features are expected to be relevant, but with smaller weights (L2).

Key Terms

Sparsity: The property of a model having many of its coefficients equal to zero.
Absolute Value: The magnitude of a number, ignoring its sign, used in the L1 penalty.
Squared Value: The value of a number multiplied by itself, used in the L2 penalty.
Feature Selection: The process of identifying and selecting a subset of relevant features for building a model.
Coefficient Shrinkage: Reducing the magnitude of the coefficients in a model.

Example

Consider a linear regression model with 100 features, predicting sales based on advertising spend across different channels. Some channels might have a negligible impact on sales.

L1 Regularization (Lasso): If we apply L1 regularization, the model might set the coefficients for those irrelevant channels to exactly zero. For instance, if "radio ads" and "billboard ads" are not contributing significantly to sales, their coefficients might become zero, effectively removing them from the model. This simplifies the model and makes it more interpretable.
- Example coefficients with L1: \[1.2, 0.0, 3.5, 0.0, 0.8, ..., 0.0]
L2 Regularization (Ridge): If we apply L2 regularization, the model will shrink the coefficients for all channels, but it's unlikely to make any of them exactly zero. So, "radio ads" and "billboard ads" would have small coefficients (e.g., 0.01, 0.02) but would still be present in the model. This helps in reducing the impact of less important features without completely eliminating them.
- Example coefficients with L2: \[1.1, 0.2, 3.2, 0.1, 0.7, ..., 0.1]

Numerically, imagine two coefficients: $\beta_1=10$ and $\beta_2=2$. With L1 regularization and a penalty factor $\lambda=0.5$, the penalty would be $0.5(|10|+|2|)=6$. With L2 regularization and $\lambda=0.5$, the penalty would be $0.5(10^2+2^2)=52$. This difference in penalty grows quadratically for L2, leading to stronger shrinkage for larger coefficients compared to L1.

Code Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split

# Generate synthetic data with many features, only a few being relevant
np.random.seed(42)
n_samples, n_features = 50, 100
X = np.random.randn(n_samples, n_features)
true_coef = np.zeros(n_features)
true_coef[:5] = np.random.randn(5) # only first 5 features are relevant
y = X.dot(true_coef) + 0.5 * np.random.randn(n_samples)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit Lasso (L1) regression
lasso = Lasso(alpha=0.1)  # Alpha is the regularization strength
lasso.fit(X_train, y_train)

# Fit Ridge (L2) regression
ridge = Ridge(alpha=1.0) # Alpha can be different from Lasso
ridge.fit(X_train, y_train)

# Plot coefficients
plt.figure(figsize=(10, 6))

plt.plot(lasso.coef_, label='Lasso (L1) Coefficients', marker='o')
plt.plot(ridge.coef_, label='Ridge (L2) Coefficients', marker='x')
plt.plot(true_coef, label='True Coefficients', marker='*')

plt.xlabel('Coefficient Index')
plt.ylabel('Coefficient Value')
plt.title('Lasso vs Ridge Coefficients')
plt.legend()
plt.grid(True)

# Annotate the plot with sparsity information
lasso_sparsity = np.sum(lasso.coef_ == 0)
ridge_sparsity = np.sum(ridge.coef_ == 0)
plt.text(0.6, 0.8, f'Lasso Sparsity (zero coefs): {lasso_sparsity}', transform=plt.gca().transAxes)
plt.text(0.6, 0.75, f'Ridge Sparsity (zero coefs): {ridge_sparsity}', transform=plt.gca().transAxes)

plt.show()

Related Concepts

Elastic Net Regularization: Combines L1 and L2 penalties, offering a balance between sparsity and coefficient shrinkage. Be prepared to discuss how it addresses the limitations of L1 and L2.
Bias-Variance Tradeoff: L1 and L2 affect this tradeoff differently; L1 increases bias more but can significantly reduce variance by feature selection, while L2 primarily reduces variance with a smaller increase in bias.
Gradient Descent: Understanding how L1 and L2 penalties affect the gradient updates in optimization algorithms is crucial. Interviewers might ask about the differences in convergence behavior.