Study Card: K-Means Clustering Steps

Direct Answer

K-means clustering involves iteratively partitioning data into k clusters. The steps include choosing the number of clusters (k), initializing centroids, assigning data points to the nearest centroid, updating centroids based on the assigned points, and repeating the assignment and update steps until convergence.
Key points: Choosing k, Initializing centroids, Assigning points, Updating centroids, Convergence.
Practical Applications: Customer segmentation, image compression, anomaly detection.

Key Terms

Centroid: The center of a cluster, calculated as the mean of all data points assigned to that cluster.
Cluster: A group of similar data points.
K: The number of clusters.
Convergence: The point where the centroids no longer change significantly between iterations or max iterations is reached.
Euclidean Distance: Commonly used to measure the distance between data points and centroids. Other distance measures like Manhattan distance can be used as well.

Example

Imagine grouping customers into market segments. 1. Choose k: Decide on the number of segments (e.g., k=3). 2. Initialize: Randomly select 3 customers as initial segment representatives (centroids). 3. Assign: Assign each customer to the closest representative based on their purchase history. 4. Update: Recalculate the representative for each segment by averaging purchases of assigned customers. 5. Repeat: Repeat steps 3 and 4 until segment representatives and customer groups stabilize.

Code Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs  # For generating sample data

# 1. Generate Sample Data (replace with your own data)
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) # Example with 4 clusters

# 2. Choose the number of clusters (k)
k = 4

# 3. Initialize centroids randomly (you can use other methods too, as explained in the Related Concepts)
kmeans = KMeans(n_clusters=k, init='random', random_state=42, n_init = 1) #n_init = 1 to illustrate using single initialization, typically a higher value is used, like 10 for multiple random starts.
kmeans.fit(X)
initial_centroids = kmeans.cluster_centers_ # Store the initial centroids for later vizualization

# Fit the KMeans model (Steps 3, 4 and 5 combined as part of kmeans.fit())

# Get cluster labels
labels = kmeans.labels_
final_centroids = kmeans.cluster_centers_ #Centroids after the algorithm converges.

# Plotting to illustrate how centroids shift during iterations.
plt.figure(figsize=(10,6))
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis')
plt.scatter(initial_centroids[:, 0], initial_centroids[:, 1], c='red', marker='X', s=200, label='Initial Centroids')
plt.scatter(final_centroids[:, 0], final_centroids[:, 1], c='black', marker='X', s=200, label='Final Centroids')
plt.title(f'K-Means Clustering Results (k={k})')
plt.legend()
plt.show()

Related Concepts

Choosing k: Determining the optimal number of clusters is crucial. Techniques like the Elbow Method, Silhouette Analysis, and the Gap Statistic can be used. Follow up question: Explain how to select k in k-means.
K-means++: An improved centroid initialization method that often leads to better results than random initialization. Follow-up: Why is k-means++ initialization preferred over random initialization?
Distance Metrics: K-means typically uses Euclidean distance, but other distance metrics can be applied depending on the data and desired behavior. Follow up: Explain different distance metrics that can be used with k-means, and when one is preferred over another.
Convergence Criteria: Understanding how k-means determines when to stop iterating (e.g., no change in centroids, maximum iterations).