3.b) Explain K-means algorithm with example.
Answer:
k-means Algorithm
- K-means is an unsupervised learning technique that aims to define the correct answer by identifying clusters within the data. Consider some user level data and assume each row of your dataset corresponds to a user as follows: age, gender, income, state, household, size.
- The goal is to segment users, a process known as segmenting, stratifying, grouping, or clustering the data. All these terms refer to finding similar types of users and grouping them together.
- k-means is a clustering algorithm where k is the number of bins. k-means algorithm looks for clusters in d dimensions, where d is the number of features for each data point.
Algorithm
Randomly pick 𝑘 centroids (or points that will be the center of your clusters) in 𝑑-space, ensuring they are near the data but distinct from one another.
- Assign each data point to the closest centroid.
- Move the centroids to the average location of the data points assigned to them.
- Repeat the previous two steps until the assignments don’t change or change very little.
K-means advantages:
- k-means is pretty fast (compared to other clustering algorithms),
- There are broad applications in marketing, computer vision (partitioning an image).
- Can be a starting point for other models.
Example:
- Consider a simpler example than the five-dimensional one previously discussed. Suppose there is data on users, including the number of ads shown to each user (impressions) and the number of times each user clicked on an ad (clicks).
- Clustering in two dimensions; look at the panels in the left column from top to bottom, and then the right column from top to bottom.
- In practice, k-means is just one line of code in R:
The kmeans function in R has several parameters:
- x: The dataset, which should be a numeric matrix or data frame where each row represents an observation and each column represents a feature.
- centers: Specifies the number of clusters (k) to create or the initial cluster centers. If an integer, it indicates the number of clusters; if a matrix, each row represents an initial cluster center.
- iter.max: The maximum number of iterations allowed. The default is 10. This parameter controls how long the algorithm will run before stopping.
- nstart: The number of random sets of initial cluster centers. The algorithm will run nstart times with different initial centers and return the best solution based on the total within-cluster sum of squares. The default is 1.
algorithm: Specifies the algorithm to use for clustering. Options include:
- “Hartigan-Wong” (default): The standard algorithm proposed by Hartigan and Wong.
- “Lloyd”: Also known as the standard k-means algorithm.
- “Forgy”: Another version of the k-means algorithm.
- “MacQueen”: A variation of the k-means algorithm.
Example: kmeans(x, centers = 3, iter.max = 10, nstart = 1, algorithm = “Hartigan-Wong”)