Study Card: How does Bagging work?

Direct Answer

Bagging, or Bootstrap Aggregating, is an ensemble learning technique that improves the stability and accuracy of machine learning algorithms. It works by training multiple instances of the same algorithm on different subsets of the training data, created through random sampling with replacement (bootstrapping). The final prediction is then obtained by averaging (for regression) or voting (for classification) the predictions of these individual models. A key benefit is reducing variance and overfitting, leading to more robust models.

Key Terms

Example

Imagine training a decision tree to predict customer churn. Instead of training one tree on the entire dataset, with bagging, we create, say, 100 bootstrap samples (subsets with replacement) of the data. Each sample is used to train a separate decision tree. For a new customer, each tree predicts whether they will churn or not. If we have 70 trees predicting "churn" and 30 predicting "no churn", the final bagged prediction would be "churn" (majority voting). For a regression problem predicting customer lifetime value, we would average the 100 predicted values from each tree to get the final LTV estimate.

Code Implementation

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample Data
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10],
              [1,3],[2,4],[6,7],[8,9],[3,6]])
y = np.array([0, 0, 1, 1, 1, 0, 0, 1, 1, 0])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Base Estimator (Decision Tree)
base_estimator = DecisionTreeClassifier()

# Bagging Classifier
bagging_model = BaggingClassifier(estimator=base_estimator, n_estimators=10, random_state=42)

# Fit the model
bagging_model.fit(X_train, y_train)

# Predictions
y_pred = bagging_model.predict(X_test)

# Evaluate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Bagging Model Accuracy: {accuracy}")

# Individual decision trees for demonstration
for i in range(10):
    tree = DecisionTreeClassifier()
    indices = np.random.choice(len(X_train), len(X_train), replace=True)
    X_sample, y_sample = X_train[indices], y_train[indices]
    tree.fit(X_sample, y_sample)
    tree_pred = tree.predict(X_test)
    tree_accuracy = accuracy_score(y_test, tree_pred)
    print(f"Decision Tree {i+1} Accuracy: {tree_accuracy}")

Related Concepts