Study Card: Loss Function of Logistic Regression

Direct Answer

Logistic regression uses the binary cross-entropy (log loss) function as its loss function. This loss function measures the dissimilarity between the predicted probabilities and the actual binary labels (0 or 1). Minimizing the log loss encourages the model to assign high probabilities to the correct class and low probabilities to the incorrect class, effectively learning the relationship between the input features and the target variable. It's chosen because it is convex, making it easier for optimization algorithms like gradient descent to find a global minimum, hence efficient training.

Key Terms

Log Loss (Binary Cross-Entropy): Measures the performance of a classification model where the prediction input is a probability value between 0 and 1.
Maximum Likelihood Estimation (MLE): A method of estimating model parameters that maximizes the likelihood of observing the given data. Minimizing log loss is equivalent to MLE for logistic regression.
Convex Function: A function with no local minima, which simplifies optimization.
Gradient Descent: An iterative optimization algorithm used to find the minimum of a function.
Sigmoid Function: Maps the output of the linear model to a probability between 0 and 1.

Example

Consider a model predicting whether a customer will click on an ad (1 for click, 0 for no click). For a given customer, if the true label is 1 (click) and the predicted probability of a click is 0.9, the log loss will be low. If the predicted probability is 0.1, the log loss will be high. Minimizing the log loss across all training examples encourages the model to make accurate probability predictions. For instance, if the model predicts probabilities close to 0.5 for all instances, the log loss will be relatively high because the model isn't confident in its predictions.

Code Implementation

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """Sigmoid function."""
    return 1 / (1 + np.exp(-z))

def log_loss(y_true, y_pred):
    """
    Calculates the binary cross-entropy/log loss.
    Handles extreme probability values (0 or 1) using epsilon to avoid numerical instability.

    Args:
      y_true: True binary labels (0 or 1).
      y_pred: Predicted probabilities.

    Returns:
      The log loss.
    """
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon) # clip predicted probability
    loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

# Example usage
y_true = np.array([0, 1, 1, 0, 1])
y_pred_prob = np.array([0.1, 0.9, 0.8, 0.2, 0.7])

loss = log_loss(y_true, y_pred_prob)
print(f"Log Loss: {loss}")

# Visualizing Log Loss for different probability predictions with y_true=1
y_pred_prob_example = np.linspace(0.001, 0.999, 100)
y_true_example = np.array([1] * 100)
loss_example = log_loss(y_true_example, y_pred_prob_example)
loss_curve = [-np.log(pred) for pred in y_pred_prob_example]
plt.figure(figsize=(8,6))
plt.plot(y_pred_prob_example, loss_curve)
plt.xlabel("Predicted Probability (y_pred) with y_true=1")
plt.ylabel("Log Loss")
plt.title("Log Loss vs Predicted Probability")
plt.grid(True)

# Visualizing Log Loss for different probability predictions with y_true=0
y_pred_prob_example = np.linspace(0.001, 0.999, 100)
y_true_example = np.array([0] * 100)
loss_example = log_loss(y_true_example, y_pred_prob_example)
loss_curve = [-np.log(1 - pred) for pred in y_pred_prob_example]
plt.figure(figsize=(8,6))
plt.plot(y_pred_prob_example, loss_curve)
plt.xlabel("Predicted Probability (y_pred) with y_true=0")
plt.ylabel("Log Loss")
plt.title("Log Loss vs Predicted Probability")
plt.grid(True)
plt.show()

Related Concepts

Regularization: L1 and L2 regularization are often added to the logistic regression loss function to prevent overfitting. Interviewers might ask how regularization affects the loss function and the learning process.
Optimization Algorithms: Different optimization algorithms like gradient descent, stochastic gradient descent, and Newton-Raphson can be used to minimize the log loss. Be prepared to discuss their advantages and disadvantages.
Multi-class Classification: For multi-class problems, the categorical cross-entropy loss function, a generalization of binary cross-entropy, is used. Be ready to explain how it differs from binary cross-entropy.
Alternative Loss Functions: While log loss is the standard choice for logistic regression, other loss functions can be used, such as hinge loss (used in SVMs). Interviewers could ask you to compare different loss functions and their suitability.