Study Card: Stochastic Gradient Descent (SGD)

Title

Stochastic Gradient Descent (SGD)

Direct Answer

Stochastic Gradient Descent (SGD) is an iterative optimization algorithm used to find the minimum of a function, commonly the loss function in machine learning. Unlike Gradient Descent, which uses the entire dataset to compute the gradient at each iteration, SGD uses a single data point (or a small mini-batch) to estimate the gradient, making it much faster for large datasets. While faster, SGD introduces noise due to the use of individual data points hence requires careful tuning of the learning rate. Key points are that this "stochasticity" can help escape local minima and that it is widely used for training various machine learning models.

Key Terms

Example

Imagine training a linear regression model to predict house prices. Gradient Descent would calculate the error for all houses in the training dataset and adjust model parameters accordingly in each iteration. SGD, on the other hand, would pick a single house, calculate its prediction error, and update the model parameters based on this single error, repeating with another randomly selected house for next iteration. If the learning rate is too large, updates might overshoot and the model might fail to converge to optimal weights. If too small it would take many iterations to converge. For a large dataset, if we have 1 million data points and do batch gradient descent it would take a long time to converge. SGD would update more frequently but with noise since update is done with respect to only one point. Mini-batch SGD balances this by using let's say 128 points to determine gradient.

Code Implementation

import numpy as np

def stochastic_gradient_descent(X, y, theta, learning_rate=0.01, n_epochs=100, batch_size=1):
    """
    Stochastic Gradient Descent for Linear Regression.

    Args:
        X: Input features (NumPy array).
        y: Target values (NumPy array).
        theta: Model parameters (NumPy array).
        learning_rate: Step size for updates.
        n_epochs: Number of passes through the data.
        batch_size: Number of data points to use in each update.
    """
    m = len(y)  # Number of samples
    n_batches = m // batch_size  # Calculate total number of mini batches needed to cover whole dataset once.
    for epoch in range(n_epochs):
        for i in range(n_batches):
            start_idx = i * batch_size
            end_idx = min((i + 1) * batch_size, m)
            X_batch = X[start_idx:end_idx, :] # get the samples for this mini batch
            y_batch = y[start_idx:end_idx] # target value for this mini batch
            predictions = X_batch.dot(theta) # do a prediction
            errors = predictions - y_batch  # prediction error
            gradient = X_batch.T.dot(errors) / batch_size # gradient averaged over batch size to reduce noise
            theta = theta - learning_rate * gradient # update the parameters. This update contains noise hence stochastic, but would be less noisy than single data point update if batch size is higher than 1. For large datasets due to higher update frequency in SGD the model would quickly converge to optimal weights.
        if (epoch % 10) == 0:
            print(f"Epoch: {epoch}, Loss: {np.mean(errors**2):.4f}")
    return theta

# Sample Data (replace with your own dataset)
X = np.array([[1, 2], [2, 3], [3, 1], [4, 3], [5, 3], [6, 2]])
y = np.array([3, 5, 3, 7, 8, 7])
theta = np.array([0.0, 0.0])  # initialize the parameters

theta = stochastic_gradient_descent(X, y, theta, batch_size=2) # use 2 samples instead of just 1. Can also change learning rate to see the effect.
print(f"Learned parameters: {theta}")

Related Concepts