DENSITY-BASED METHODS: DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
- Clusters are dense regions separated by areas of low density (noise).
- Matches how humans intuitively see clusters.
Key ideas:
- Uses two parameters:
- ε (epsilon): radius of neighborhood.
- m: minimum number of points required inside ε to define a dense region.
- Three types of points:
- Core point: has ≥ m points in its ε-neighborhood.
- Border point: fewer than m points in ε-neighborhood, but reachable from a core point.
- Noise point: neither core nor border.
Density connectivity concepts:
- Directly density-reachable:
- X is directly reachable from Y if:
- X is within ε of Y.
- Y is a core point.
- X is directly reachable from Y if:
- Density-reachable:
- X is density-reachable from Y if there’s a chain of core points from Y to X.
- Density-connected:
- X and Y are density-connected if there exists a core point Z from which both X and Y are density-reachable.
DBSCAN Algorithm
- Randomly pick an unvisited point p.
- Find its ε-neighborhood:
- If it has ≥ m points ⇒ mark as core point, start a new cluster or add to an existing one.
- Else if it’s a border point ⇒ mark as visited but don’t start a new cluster.
- Else ⇒ mark as noise.
- Expand the cluster by visiting neighbors of core points.
- Repeat until all points are visited.
- Merge clusters if mergeable.
Advantages:
- Number of clusters k need not be specified.
- Can detect arbitrary shaped clusters.
- Robust to noise.
- Needs only a few parameters.
Complexity: depends on number of data points and ε-neighborhood computation.
GRID-BASED APPROACH: CLIQUE
CLIQUE (Clustering in Quest):
- Designed for high-dimensional data.
- Uses a multi-resolution grid structure to partition the data space.
- Divide each dimension into equal-sized intervals ⇒ creates grid cells.
- Identify dense cells: cells containing ≥ τ points (threshold).
- Dense cells that are adjacent or connected are merged to form clusters.
Monotonicity property:
- If a k-dimensional cell is dense ⇒ all its (k–1)-dimensional projections must also be dense.
- Like Apriori rule in association mining:
- If a cell is not dense, its higher-dimensional extensions can’t be dense.
- Helps prune search space.
CLIQUE Algorithm
Stage 1: Identify dense units
- Divide data space into grids.
- Identify dense cells in 1D.
- Use Apriori principle to find dense units in higher dimensions.
Stage 2: Cluster formation
- Merge dense cells that are adjacent into maximal regions.
- Form final clusters from these merged dense units.
Advantages:
- Works well in high-dimensional spaces.
- Finds clusters in subspaces.
- Insensitive to data order.
Disadvantages:
- Needs careful tuning of:
- Grid size
- Density threshold τ
- Improper choice may lead to too many or too few clusters.