Study Card: Pros and Cons of the Sigmoid Function

Direct Answer

The sigmoid function is an activation function that maps any input value to a value between 0 and 1. Its primary advantage is its interpretability as a probability and its ability to introduce non-linearity into a model. However, it suffers from vanishing gradients, slow convergence during training, and non-zero centered output, which can hinder optimization. Despite its limitations, it remains widely used in binary classification problems, particularly in the output layer, due to its probability interpretation.

Key Terms

Activation Function: Introduces non-linearity into a neural network, enabling it to learn complex patterns.
Vanishing Gradients: The problem where gradients become very small during backpropagation, making it difficult to update weights in earlier layers of the network.
Non-zero Centered: Output values are not centered around zero, which can affect the dynamics of gradient descent.
Binary Classification: Assigning data points to one of two classes.
Probability Interpretation: Output of sigmoid can be directly interpreted as the probability of belonging to a class.

Example

In a model predicting whether a customer will click on an ad (click/no-click), the sigmoid function is used in the output layer. An output of 0.8 signifies an 80% probability of clicking on the ad. During backpropagation, however, if the output is close to 0 or 1, the gradient will be close to 0. This becomes a problem in deep networks as the gradients get multiplied through layers, leading to vanishing gradients and slow learning in initial layers. The fact that the output is always positive (between 0 and 1) can introduce a zig-zag pattern in weight updates during gradient descent.

Code Implementation

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    """
    Computes the sigmoid of z.

    Args:
        z: A scalar or numpy array of any size.

    Returns:
        Sigmoid of z.
    """
    try:
      return 1 / (1 + np.exp(-z))
    except OverflowError: # Check Overflow
      return 0.0 if z < 0 else 1.0

# Generate data for plotting
z = np.linspace(-10, 10, 200)
sigmoid_z = sigmoid(z)

# Plotting the sigmoid function
plt.figure(figsize=(8, 6))
plt.plot(z, sigmoid_z)
plt.xlabel("z")
plt.ylabel("sigmoid(z)")
plt.title("Sigmoid Function")
plt.grid(True)

# Illustrating vanishing gradients
plt.figure(figsize=(8,6))
plt.plot(z, sigmoid(z)*(1-sigmoid(z))) # Plot the derivative of the sigmoid function
plt.xlabel("z")
plt.ylabel("sigmoid'(z)")
plt.title("Derivative of Sigmoid Function (Illustrates Vanishing Gradients)")
plt.grid(True)

plt.show()

Related Concepts

Other Activation Functions: ReLU, tanh, softmax. Interviewers might ask you to compare and contrast sigmoid with other activation functions.
Backpropagation: The algorithm used to calculate gradients in neural networks. Understanding how vanishing gradients affect backpropagation is crucial.
Gradient Descent Optimization: How the non-zero centered output of sigmoid impacts optimization algorithms.
Binary Cross-Entropy Loss: Commonly used with sigmoid in binary classification.