Study Card: L1 vs L2 Regularization

Direct Answer

L1 (Lasso) and L2 (Ridge) regularization are techniques to prevent overfitting by adding a penalty term to the loss function, but they differ in the type of penalty applied. L1 regularization adds a penalty proportional to the absolute value of the coefficients, promoting sparsity by driving some coefficients to exactly zero. L2 regularization adds a penalty proportional to the square of the coefficients, shrinking all coefficients towards zero but rarely making them exactly zero. The choice between L1 and L2 depends on whether feature selection is desired (L1) or if all features are expected to be relevant, but with smaller weights (L2).

Key Terms

Example

Consider a linear regression model with 100 features, predicting sales based on advertising spend across different channels. Some channels might have a negligible impact on sales.

Numerically, imagine two coefficients: $\beta_1=10$ and $\beta_2=2$. With L1 regularization and a penalty factor $\lambda=0.5$, the penalty would be $0.5(|10|+|2|)=6$. With L2 regularization and $\lambda=0.5$, the penalty would be $0.5(10^2+2^2)=52$. This difference in penalty grows quadratically for L2, leading to stronger shrinkage for larger coefficients compared to L1.

Code Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split

# Generate synthetic data with many features, only a few being relevant
np.random.seed(42)
n_samples, n_features = 50, 100
X = np.random.randn(n_samples, n_features)
true_coef = np.zeros(n_features)
true_coef[:5] = np.random.randn(5) # only first 5 features are relevant
y = X.dot(true_coef) + 0.5 * np.random.randn(n_samples)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit Lasso (L1) regression
lasso = Lasso(alpha=0.1)  # Alpha is the regularization strength
lasso.fit(X_train, y_train)

# Fit Ridge (L2) regression
ridge = Ridge(alpha=1.0) # Alpha can be different from Lasso
ridge.fit(X_train, y_train)

# Plot coefficients
plt.figure(figsize=(10, 6))

plt.plot(lasso.coef_, label='Lasso (L1) Coefficients', marker='o')
plt.plot(ridge.coef_, label='Ridge (L2) Coefficients', marker='x')
plt.plot(true_coef, label='True Coefficients', marker='*')

plt.xlabel('Coefficient Index')
plt.ylabel('Coefficient Value')
plt.title('Lasso vs Ridge Coefficients')
plt.legend()
plt.grid(True)

# Annotate the plot with sparsity information
lasso_sparsity = np.sum(lasso.coef_ == 0)
ridge_sparsity = np.sum(ridge.coef_ == 0)
plt.text(0.6, 0.8, f'Lasso Sparsity (zero coefs): {lasso_sparsity}', transform=plt.gca().transAxes)
plt.text(0.6, 0.75, f'Ridge Sparsity (zero coefs): {ridge_sparsity}', transform=plt.gca().transAxes)

plt.show()

Related Concepts