Study Card: Choosing k in k-Nearest Neighbors (k-NN)

Direct Answer

Key Terms

Example

Imagine classifying images as cats or dogs. If k=1, the new image is classified based on the single closest image in the training set. This can be sensitive to noisy data or outliers. If k is very large (say, the entire training set size), each new image will always be predicted as the majority class (whichever animal you have more of). A moderate k value (e.g., 3, 5, or 7) often provides a good balance. Cross-validation on a held-out set helps you empirically select the best k for your specific dataset.

Code Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_classification # Creating synthetic data

# Generate a synthetic dataset
X, y = make_classification(n_samples=500, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Find optimal k using GridSearchCV with cross-validation
param_grid = {'n_neighbors': np.arange(1, 21)} # Test k values from 1 to 20
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy') # 5-fold cross-validation

grid_search.fit(X_train, y_train)
best_k = grid_search.best_params_['n_neighbors']

print(f"Best k: {best_k}")

# Train k-NN with the best k
best_knn = KNeighborsClassifier(n_neighbors=best_k)
best_knn.fit(X_train, y_train)

# Predict on the test set
y_pred = best_knn.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with best k: {accuracy}")

# Plot to show error for different k-values

k_range = list(range(1,31))
param_grid = dict(n_neighbors=k_range)

# Use cross-validation to find the optimal k
grid = GridSearchCV(best_knn, param_grid, cv=10, scoring='accuracy')  # cv = number of folds

grid.fit(X, y)

# Store results
scores = grid.cv_results_['mean_test_score']

# Plot results for visualization
plt.figure()
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
plt.xticks(k_range)  # Show all k values on the x-axis
plt.grid(True)
plt.show()

Related Concepts