Study Card: How to choose number of features in random forest?

Direct Answer

The number of features considered at each split in a Random Forest, often denoted as max_features, is a crucial hyperparameter. It controls the randomness and diversity of trees, impacting the bias-variance tradeoff. While rules of thumb like the square root of the total number of features for classification or one-third for regression exist, the optimal value is dataset-dependent. Therefore, careful experimentation and tuning, often through cross-validation or using out-of-bag (OOB) scores, are necessary to find the best max_features for a specific problem.

Key Terms

Example

Consider predicting customer churn with features like age, contract length, monthly bill, data usage, etc.

Code Implementation

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the dataset (replace with your own)
from sklearn.datasets import load_wine
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use GridSearchCV for hyperparameter tuning including max_features
param_grid = {
    'n_estimators': [50, 100, 200], # you might want to search for a range to optimize for best value for particular data via random search or grid search
    'max_features': ['sqrt', 'log2', None, 0.33, 0.5, 0.75, 1.0], # add more as needed
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
rf = RandomForestClassifier(random_state=0)
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
                          cv = 3, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_

rf_optimized = RandomForestClassifier(**best_params)
rf_optimized.fit(X_train,y_train)
y_pred = rf_optimized.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Best parameters: {best_params}")
print(f"Test accuracy: {test_accuracy}")

# Alternatively, evaluate using OOB scores: more efficient but less robust than K-fold cross-validation.

oob_scores = []
for max_features in range(1, X.shape[1]+1): # test from 1 to all features. For higher dimensional datasets, start from sqrt to total or 0.33 to total. Then reduce range if plots suggest.
    rf = RandomForestClassifier(n_estimators=100, oob_score=True, max_features=max_features, random_state=42)
    rf.fit(X_train, y_train)
    oob_scores.append(rf.oob_score_)

# find best from plot where it plateaus. Can use k-fold cross validation for more robustness, or use this for initial search range, then do gridsearch or random search to optimize further by reducing search range based on outputs.

Related Concepts