Both standardization and normalization are feature scaling techniques used to transform data into a similar range. Standardization (z-score normalization) scales features to have zero mean and unit variance, while normalization typically scales features to a range between 0 and 1. Standardization is less sensitive to outliers but doesn't bound the data to a specific range. Normalization is sensitive to outliers but ensures all features are within the same scale. The choice between them depends on the specific algorithm and the data distribution. Algorithms like Support Vector Machines (SVMs) and K-Nearest Neighbors (KNN) often benefit from standardization, while algorithms that require input within a specific range (0-1), such as some neural networks with sigmoid activations might benefit from normalization.
Consider a dataset with features "age" (ranging from 20 to 80) and "income" (ranging from $30,000 to $300,000). In KNN, the "income" feature would dominate the distance calculation due to its larger range. Standardizing both features would give them equal weight, preventing "income" from overshadowing "age." If we train a neural network with sigmoid activation functions, normalization might be beneficial as the sigmoid activation expects values in a fixed range. For example, if there was one individual with an unusually high income (e.g. \$1,000,000), normalization would compress all other incomes to very low values, potentially distorting the learning process whereas standardization wouldn't compress the other data points since the mean and standard deviation would reflect this large income.
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Sample data with different scales
data = np.array([[1, 1000], [2, 2000], [3, 3000], [4, 4000], [5, 5000]])
# Standardization
scaler = StandardScaler()
data_standardized = scaler.fit_transform(data)
print("Standardized Data:\\\\n", data_standardized)
# Normalization
scaler = MinMaxScaler()
data_normalized = scaler.fit_transform(data)
print("\\\\nNormalized Data:\\\\n", data_normalized)
# Illustrative example with outlier
data_with_outlier = np.array([[1, 10], [2, 20], [3, 30], [4, 1000]]) # added an outlier
#Standardization
scaler = StandardScaler()
standardized_with_outlier = scaler.fit_transform(data_with_outlier)
print("\\\\nStandardized data with outlier:\\\\n", standardized_with_outlier)
#Normalization
scaler = MinMaxScaler()
normalized_with_outlier = scaler.fit_transform(data_with_outlier)
print("\\\\nNormalized data with outlier:\\\\n", normalized_with_outlier)