Study Card: Why MSE Doesn't Work Well with Logistic Regression

Direct Answer

MSE (Mean Squared Error) is not ideal for logistic regression because the logistic function combined with MSE results in a non-convex loss function with potentially multiple local minima. This makes it difficult for optimization algorithms like gradient descent to find the optimal solution. Furthermore, the MSE loss penalizes predictions far from the actual labels more heavily which, in the case of probability predictions from Logistic Regression can lead to large swings in model parameters when predictions are near 0 or 1. Cross-entropy (Log Loss), which is convex, is preferred as it simplifies optimization, ensures convergence to the global minimum, and penalizes prediction errors more gracefully across the probability range.

Key Terms

Mean Squared Error (MSE): A common loss function that measures the average squared difference between predicted and actual values.
Logistic Function (Sigmoid): Maps any input value to a value between 0 and 1, used for probability prediction in logistic regression.
Convex Function: A function with a single global minimum, making optimization easier.
Non-convex Function: A function that can have multiple local minima, making optimization challenging.
Gradient Descent: An iterative optimization algorithm used to find the minimum of a function.

Example

Consider predicting click-through rates (CTR) for ads. If the true CTR is 1 (click) and the model predicts 0.1, MSE will heavily penalize this prediction. If the true CTR is 0 (no click) and the model predicts 0.9, MSE again penalizes strongly. However, the gradients in these areas combined with the non-convex surface can cause gradient descent to get stuck in suboptimal solutions. The same error in probability prediction near 0 or 1 has a very different size of gradient in the MSE loss compared to near 0.5, this instability makes the model training unstable. Cross-entropy, however, creates a convex loss landscape when used with logistic regression. Its gradients encourage confident and accurate predictions while still providing a relatively smooth trajectory for optimization algorithms to descend along.

Code Implementation

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """Sigmoid function."""
    return 1 / (1 + np.exp(-z))

def mse_loss(y_true, y_pred):
    """Mean Squared Error loss."""
    return np.mean((y_true - y_pred)**2)

def cross_entropy_loss(y_true, y_pred):
    """Binary Cross-Entropy / Log Loss."""
    epsilon = 1e-15 # avoid numerical error when log(y_pred) is 0
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

# Example probabilities and true labels
y_true = np.array([0, 1, 1, 0])
y_pred_prob = np.linspace(0.01, 0.99, 4)

mse_losses = mse_loss(y_true, y_pred_prob)
ce_losses = cross_entropy_loss(y_true, y_pred_prob)

print("MSE Loss:", mse_losses)
print("Cross-Entropy Loss:", ce_losses)

# Visualize MSE and Cross-Entropy Loss for Logistic Regression
x = np.linspace(-5, 5, 500)

y_true = 1  # Example - True class is 1
mse = mse_loss(y_true, sigmoid(x))
ce = cross_entropy_loss(y_true, sigmoid(x))

plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x, mse)
plt.title("MSE Loss with Sigmoid")
plt.xlabel("x")
plt.ylabel("Loss")

plt.subplot(1, 2, 2)
plt.plot(x, ce)
plt.title("Cross-Entropy Loss with Sigmoid")
plt.xlabel("x")
plt.ylabel("Loss")
plt.show()

Related Concepts

Convex Optimization: Understanding the importance of convexity in optimization problems. Interviewers may ask about optimization algorithms and their behavior with convex vs. non-convex functions.
Maximum Likelihood Estimation (MLE): Cross-entropy loss in logistic regression is equivalent to MLE, a statistically sound method for parameter estimation.
Alternative Loss Functions: While less common, other loss functions can be used with modifications to logistic regression. Discussing their properties and why they are less suitable is beneficial.