Study Card: Vanishing Gradient Problem

Direct Answer

The vanishing gradient problem occurs during training of deep neural networks using gradient-based optimization methods. As gradients are backpropagated through the network's layers, they can shrink exponentially, becoming extremely small. This makes it difficult to update the weights of earlier layers, effectively hindering the learning process, especially in these earlier layers. This is particularly problematic with activation functions like sigmoid and tanh that have derivatives close to zero in their saturated regions. Using activation functions like ReLU, proper weight initialization, and alternative architectures can mitigate this issue.

Key Terms

Example

Imagine a deep neural network with many layers using sigmoid activation. During backpropagation, the gradients of the loss function with respect to the weights are calculated. If the output of a sigmoid neuron is close to 0 or 1 (saturated regions), its gradient is close to 0. As these gradients are multiplied together across layers during backpropagation, the product becomes extremely small, especially for earlier layers. This means the weights in the earlier layers update very slowly or not at all, effectively stopping those layers from learning. This would cause problems when training a deep CNN for image classification because initial layers extracting general features like edges wouldn't be trained effectively.

Code Implementation

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(x):
  """Sigmoid activation function."""
  return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
  """Derivative of the sigmoid function."""
  return sigmoid(x) * (1 - sigmoid(x))

# Demonstrate vanishing gradient with sigmoid
x = np.linspace(-5, 5, 200)
y = sigmoid_derivative(x)  # derivative of sigmoid
plt.plot(x, y)
plt.title("Derivative of Sigmoid Function (Illustrates Vanishing Gradient)")
plt.xlabel("x")
plt.ylabel("Derivative")
plt.grid(True)
plt.show()

# Simulate backpropagation through multiple layers with sigmoid
num_layers = 10
gradients = [1.0] # Initial gradient

for i in range(num_layers):
    x = np.random.rand()  # Simulate some activation value
    gradient = gradients[-1] * sigmoid_derivative(x)  # Multiply with the derivative of sigmoid
    gradients.append(gradient)

plt.plot(range(num_layers + 1), gradients)
plt.title("Gradient Magnitude over Layers (Sigmoid)")
plt.xlabel("Layer")
plt.ylabel("Gradient Magnitude")
plt.yscale('log') # Plot on log scale
plt.grid(True)
plt.show()

Related Concepts