Study Card: How does Bagging work?

Direct Answer

Bagging, or Bootstrap Aggregating, is an ensemble learning technique that improves the stability and accuracy of machine learning algorithms. It works by training multiple instances of the same algorithm on different subsets of the training data, created through random sampling with replacement (bootstrapping). The final prediction is then obtained by averaging (for regression) or voting (for classification) the predictions of these individual models. A key benefit is reducing variance and overfitting, leading to more robust models.

Key Terms

Bootstrap: Random sampling with replacement from the original dataset.
Ensemble Learning: Combining multiple models to improve performance.
Variance Reduction: Lowering the model's sensitivity to fluctuations in the training data.
Aggregation: Combining predictions from multiple models, through averaging or voting.

Example

Imagine training a decision tree to predict customer churn. Instead of training one tree on the entire dataset, with bagging, we create, say, 100 bootstrap samples (subsets with replacement) of the data. Each sample is used to train a separate decision tree. For a new customer, each tree predicts whether they will churn or not. If we have 70 trees predicting "churn" and 30 predicting "no churn", the final bagged prediction would be "churn" (majority voting). For a regression problem predicting customer lifetime value, we would average the 100 predicted values from each tree to get the final LTV estimate.

Code Implementation

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample Data
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10],
              [1,3],[2,4],[6,7],[8,9],[3,6]])
y = np.array([0, 0, 1, 1, 1, 0, 0, 1, 1, 0])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Base Estimator (Decision Tree)
base_estimator = DecisionTreeClassifier()

# Bagging Classifier
bagging_model = BaggingClassifier(estimator=base_estimator, n_estimators=10, random_state=42)

# Fit the model
bagging_model.fit(X_train, y_train)

# Predictions
y_pred = bagging_model.predict(X_test)

# Evaluate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Model Accuracy: {accuracy}")

# Individual decision trees for demonstration
for i in range(10):
    tree = DecisionTreeClassifier()
    indices = np.random.choice(len(X_train), len(X_train), replace=True)
    X_sample, y_sample = X_train[indices], y_train[indices]
    tree.fit(X_sample, y_sample)
    tree_pred = tree.predict(X_test)
    tree_accuracy = accuracy_score(y_test, tree_pred)
    print(f"Decision Tree {i+1} Accuracy: {tree_accuracy}")

Related Concepts

Random Forest: An extension of bagging that also decorrelates trees by using a random subset of features at each split. Interviewers might ask how Random Forest improves upon bagging.
Boosting: Another ensemble technique, but instead of training models independently, it trains them sequentially, each trying to correct the errors of its predecessors. Expect questions comparing bagging and boosting.
Bias-Variance Tradeoff: Bagging primarily reduces variance. Interviewers may ask how bagging affects this tradeoff and how it compares to techniques that reduce bias.
Out-of-Bag (OOB) Error: A method to estimate the performance of a bagged model without needing a separate validation set. Be prepared to explain how OOB error is calculated and its significance.