MSE (Mean Squared Error) is not ideal for logistic regression because the logistic function combined with MSE results in a non-convex loss function with potentially multiple local minima. This makes it difficult for optimization algorithms like gradient descent to find the optimal solution. Furthermore, the MSE loss penalizes predictions far from the actual labels more heavily which, in the case of probability predictions from Logistic Regression can lead to large swings in model parameters when predictions are near 0 or 1. Cross-entropy (Log Loss), which is convex, is preferred as it simplifies optimization, ensures convergence to the global minimum, and penalizes prediction errors more gracefully across the probability range.
Consider predicting click-through rates (CTR) for ads. If the true CTR is 1 (click) and the model predicts 0.1, MSE will heavily penalize this prediction. If the true CTR is 0 (no click) and the model predicts 0.9, MSE again penalizes strongly. However, the gradients in these areas combined with the non-convex surface can cause gradient descent to get stuck in suboptimal solutions. The same error in probability prediction near 0 or 1 has a very different size of gradient in the MSE loss compared to near 0.5, this instability makes the model training unstable. Cross-entropy, however, creates a convex loss landscape when used with logistic regression. Its gradients encourage confident and accurate predictions while still providing a relatively smooth trajectory for optimization algorithms to descend along.
import numpy as np
import matplotlib.pyplot as plt
def sigmoid(z):
"""Sigmoid function."""
return 1 / (1 + np.exp(-z))
def mse_loss(y_true, y_pred):
"""Mean Squared Error loss."""
return np.mean((y_true - y_pred)**2)
def cross_entropy_loss(y_true, y_pred):
"""Binary Cross-Entropy / Log Loss."""
epsilon = 1e-15 # avoid numerical error when log(y_pred) is 0
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
return loss
# Example probabilities and true labels
y_true = np.array([0, 1, 1, 0])
y_pred_prob = np.linspace(0.01, 0.99, 4)
mse_losses = mse_loss(y_true, y_pred_prob)
ce_losses = cross_entropy_loss(y_true, y_pred_prob)
print("MSE Loss:", mse_losses)
print("Cross-Entropy Loss:", ce_losses)
# Visualize MSE and Cross-Entropy Loss for Logistic Regression
x = np.linspace(-5, 5, 500)
y_true = 1 # Example - True class is 1
mse = mse_loss(y_true, sigmoid(x))
ce = cross_entropy_loss(y_true, sigmoid(x))
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x, mse)
plt.title("MSE Loss with Sigmoid")
plt.xlabel("x")
plt.ylabel("Loss")
plt.subplot(1, 2, 2)
plt.plot(x, ce)
plt.title("Cross-Entropy Loss with Sigmoid")
plt.xlabel("x")
plt.ylabel("Loss")
plt.show()