Study Card: Stochastic Gradient Descent (SGD)

Title

Stochastic Gradient Descent (SGD)

Direct Answer

Stochastic Gradient Descent (SGD) is an iterative optimization algorithm used to find the minimum of a function, commonly the loss function in machine learning. Unlike Gradient Descent, which uses the entire dataset to compute the gradient at each iteration, SGD uses a single data point (or a small mini-batch) to estimate the gradient, making it much faster for large datasets. While faster, SGD introduces noise due to the use of individual data points hence requires careful tuning of the learning rate. Key points are that this "stochasticity" can help escape local minima and that it is widely used for training various machine learning models.

Key Terms

Loss Function: A function that measures the difference between a model's predictions and the actual values. SGD aims to minimize this function.
Gradient: The vector of partial derivatives of the loss function with respect to the model's parameters. It points in the direction of steepest ascent.
Learning Rate: A hyperparameter that controls the step size taken in the direction of the negative gradient during each iteration.
Mini-Batch: A small, randomly selected subset of the training data used to estimate the gradient in each iteration.
Epoch: A complete pass through the entire training dataset.

Example

Imagine training a linear regression model to predict house prices. Gradient Descent would calculate the error for all houses in the training dataset and adjust model parameters accordingly in each iteration. SGD, on the other hand, would pick a single house, calculate its prediction error, and update the model parameters based on this single error, repeating with another randomly selected house for next iteration. If the learning rate is too large, updates might overshoot and the model might fail to converge to optimal weights. If too small it would take many iterations to converge. For a large dataset, if we have 1 million data points and do batch gradient descent it would take a long time to converge. SGD would update more frequently but with noise since update is done with respect to only one point. Mini-batch SGD balances this by using let's say 128 points to determine gradient.

Code Implementation

import numpy as np

def stochastic_gradient_descent(X, y, theta, learning_rate=0.01, n_epochs=100, batch_size=1):
    """
    Stochastic Gradient Descent for Linear Regression.

    Args:
        X: Input features (NumPy array).
        y: Target values (NumPy array).
        theta: Model parameters (NumPy array).
        learning_rate: Step size for updates.
        n_epochs: Number of passes through the data.
        batch_size: Number of data points to use in each update.
    """
    m = len(y)  # Number of samples
    n_batches = m // batch_size  # Calculate total number of mini batches needed to cover whole dataset once.
    for epoch in range(n_epochs):
        for i in range(n_batches):
            start_idx = i * batch_size
            end_idx = min((i + 1) * batch_size, m)
            X_batch = X[start_idx:end_idx, :] # get the samples for this mini batch
            y_batch = y[start_idx:end_idx] # target value for this mini batch
            predictions = X_batch.dot(theta) # do a prediction
            errors = predictions - y_batch  # prediction error
            gradient = X_batch.T.dot(errors) / batch_size # gradient averaged over batch size to reduce noise
            theta = theta - learning_rate * gradient # update the parameters. This update contains noise hence stochastic, but would be less noisy than single data point update if batch size is higher than 1. For large datasets due to higher update frequency in SGD the model would quickly converge to optimal weights.
        if (epoch % 10) == 0:
            print(f"Epoch: {epoch}, Loss: {np.mean(errors**2):.4f}")
    return theta

# Sample Data (replace with your own dataset)
X = np.array([[1, 2], [2, 3], [3, 1], [4, 3], [5, 3], [6, 2]])
y = np.array([3, 5, 3, 7, 8, 7])
theta = np.array([0.0, 0.0])  # initialize the parameters

theta = stochastic_gradient_descent(X, y, theta, batch_size=2) # use 2 samples instead of just 1. Can also change learning rate to see the effect.
print(f"Learned parameters: {theta}")

Related Concepts

Gradient Descent: SGD can be viewed as an approximation of Gradient Descent. A comparison of their properties, computational costs, convergence behavior for various data and model types is a typical interview question.
Momentum: A technique used in conjunction with SGD to accelerate convergence and reduce oscillations. Momentum method is a related topic where gradient descent parameters are updated via a parameter controlled decay and update scheme which takes past update vectors into account and current update. This helps reduce oscillations around and along the steepest gradient axis for a given learning rate.
Learning Rate Scheduling: Adjusting the learning rate during training to optimize convergence. Discussion points usually consist of different scheduling methods and their potential benefits based on data and model.