Study Card: How to choose number of features in random forest?

Direct Answer

The number of features considered at each split in a Random Forest, often denoted as max_features, is a crucial hyperparameter. It controls the randomness and diversity of trees, impacting the bias-variance tradeoff. While rules of thumb like the square root of the total number of features for classification or one-third for regression exist, the optimal value is dataset-dependent. Therefore, careful experimentation and tuning, often through cross-validation or using out-of-bag (OOB) scores, are necessary to find the best max_features for a specific problem.

Key Terms

max_features: The number of features to consider when looking for the best split at each node in a decision tree within the Random Forest. It's often denoted as mtry in other implementations.
Bootstrap Aggregating (Bagging): Random Forests use bagging, creating multiple decision trees on bootstrapped samples of the data. max_features introduces further randomness by restricting the features available to each tree.
Out-of-Bag (OOB) Score: An estimate of the model's performance on unseen data without needing a separate validation set, using data points not included in a particular tree's bootstrap sample.
Cross-Validation: A robust technique to evaluate a model's performance by splitting the data into multiple folds and iteratively training and testing, essential for hyperparameter tuning like finding the optimal max_features.

Example

Consider predicting customer churn with features like age, contract length, monthly bill, data usage, etc.

max_features = 1: Each tree only considers one random feature at each split. This leads to high diversity among trees but might ignore important feature interactions. Useful when many irrelevant features but may increase bias overall if diversity introduce through bagging and trees alone does not compensate and max_features further reduce from lower diversity at tree levels as it becomes harder to pick correct split hence introduce further bias to average impacting stability and increasing variance from average ensemble predictions with higher fluctuation due to individual tree bias that are aggregated impacting combined decision where each tree's average impact depends on how stable individual tress are thus max_features choice plays important role for overall variance stability through their individual tree's outputs which ultimately are averaged to determine the ensemble's decision and output, thus its variance.
max_features = all features: Each tree considers all features at each split. This reduces diversity, potentially leading to correlated trees and limited improvement over a single decision tree. Might not improve over single tree since now each tree is trained on same dataset with no further randomness hence no extra bagging or variance improvement over single tree would be gained but could improve over single tree due to bootstrapping and due to the fact that many decision tree average on many samples might still lead to better sample estimates.

Code Implementation

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the dataset (replace with your own)
from sklearn.datasets import load_wine
data = load_wine()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Use GridSearchCV for hyperparameter tuning including max_features
param_grid = {
    'n_estimators': [50, 100, 200], # you might want to search for a range to optimize for best value for particular data via random search or grid search
    'max_features': ['sqrt', 'log2', None, 0.33, 0.5, 0.75, 1.0], # add more as needed
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
rf = RandomForestClassifier(random_state=0)
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
                          cv = 3, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_

rf_optimized = RandomForestClassifier(**best_params)
rf_optimized.fit(X_train,y_train)
y_pred = rf_optimized.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Best parameters: {best_params}")
print(f"Test accuracy: {test_accuracy}")

# Alternatively, evaluate using OOB scores: more efficient but less robust than K-fold cross-validation.

oob_scores = []
for max_features in range(1, X.shape[1]+1): # test from 1 to all features. For higher dimensional datasets, start from sqrt to total or 0.33 to total. Then reduce range if plots suggest.
    rf = RandomForestClassifier(n_estimators=100, oob_score=True, max_features=max_features, random_state=42)
    rf.fit(X_train, y_train)
    oob_scores.append(rf.oob_score_)

# find best from plot where it plateaus. Can use k-fold cross validation for more robustness, or use this for initial search range, then do gridsearch or random search to optimize further by reducing search range based on outputs.

Related Concepts

Bias-Variance Tradeoff: max_features directly influences this tradeoff. A lower value increases variance (due to high tree diversity), while a higher value increases bias.
Curse of Dimensionality: In high-dimensional data, using a smaller fraction of features at each split becomes crucial as distances become less meaningful. The trade off and impact through parameter choices in hyper parameter optimization needs mentioning.
Feature Importance: Random Forests provide feature importance scores, which can be affected by the choice of max_features. This relation, its potential benefits to simplify models further needs mentioning.
Other Hyperparameters: The interaction of max_features with other hyperparameters like n_estimators and max_depth is essential to consider. A holistic tuning approach is required. Mentioning Gridsearch, Randomsearch, Bayesian Optimization techniques would benefit further discussions on hyperparameter optimization techniques.