HOME

Hierarchical clustering

Hierarchical clustering builds a nested sequence of clusters without requiring the number of clusters \(K\) in advance. The result is a dendrogram: a tree where each leaf is an observation and each internal node records when two clusters merged. Cutting the tree at any height yields a flat clustering. The linkage method, which defines the distance between clusters, has a larger impact on the result than almost any other choice.

Agglomerative clustering

The most common variant is agglomerative (bottom-up):

Start with \(n\) clusters, one per observation.
Compute the \(n \times n\) dissimilarity matrix.
Merge the two closest clusters into one.
Update the dissimilarity matrix (how the new cluster’s distances are computed depends on the linkage).
Repeat steps 3-4 until one cluster remains.

The merge history is stored in the dendrogram. The height at which two observations (or clusters) merge reflects how dissimilar they are: low merges = similar, high merges = dissimilar.

Divisive clustering works top-down: start with all observations in one cluster and recursively split. Computationally heavier and less commonly used.

Linkage methods

The linkage method defines the distance \(d(A, B)\) between two clusters \(A\) and \(B\) based on their constituent point distances:

Linkage	Formula	Shape of clusters
Single	\(\min_{a \in A, b \in B} d(a,b)\)	Elongated, chaining
Complete	\(\max_{a \in A, b \in B} d(a,b)\)	Compact, roughly spherical
Average (UPGMA)	\(\frac{1}{\|A\|\|B\|}\sum_{a,b} d(a,b)\)	Between single and complete
Centroid	\(d(\bar{a}, \bar{b})\)	Can invert (non-monotone)
Ward	Increase in total WCSS from merging \(A\) and \(B\)	Compact, equal-sized

Ward’s linkage merges the pair of clusters that minimizes the increase in total within-cluster sum of squares. It tends to produce compact, roughly equal-sized clusters and is the default choice for most applications with continuous data.

Four dendrograms showing the same data clustered with single, complete, average and Ward linkage illustrating how linkage choice affects the tree shape

Single linkage (top-left) chains clusters together, often producing elongated groups. Complete linkage produces compact clusters. Average linkage is intermediate. Ward (bottom-right) produces the most balanced, compact clusters and usually matches the visual grouping most closely.

Reading the dendrogram

Dendrogram from Ward hierarchical clustering with a horizontal cut line showing how three clusters are obtained

The dashed red line cuts the tree at a chosen height. Every vertical line it crosses becomes a cluster. Here cutting at this height gives three clusters (matching the true structure). The height axis represents dissimilarity: merges low in the tree join similar observations; merges near the top join very different groups.

Choosing the cut height: look for the largest gap between consecutive merge heights. A big gap means the two levels of the hierarchy are very different: the cut just below the gap is a natural number of clusters.

Distance matrices and non-Euclidean data

Hierarchical clustering works with any dissimilarity matrix, not just Euclidean distance. This makes it applicable to data types where Euclidean distance is meaningless:

Text: cosine distance or edit distance between documents.
Categorical data: Gower distance (handles mixed types).
Sequences: edit distance (Levenshtein) for DNA or protein sequences.
Time series: dynamic time warping distance.
Correlation-based: \(1 - r_{ij}\) where \(r_{ij}\) is the Pearson correlation. Widely used in genomics to cluster genes with similar expression profiles.

K-means requires computing means (centroids) and thus only works with Euclidean distance on continuous data. Hierarchical clustering works with any custom dissimilarity.

Hierarchical clustering vs K-means

	Hierarchical	K-means
Requires K upfront	No	Yes
Result	Full hierarchy (dendrogram)	Flat partition
Deterministic	Yes	No (random init)
Scalability	\(O(n^2)\) to \(O(n^3)\)	\(O(nKI)\), fast
Non-Euclidean data	Yes (any dissimilarity)	No
Cluster shapes	Flexible (depends on linkage)	Spherical only
Sensitivity to outliers	High (especially single linkage)	Moderate

Use hierarchical clustering when: you do not know \(K\), you want to explore the hierarchy, the data is not Euclidean, or \(n\) is small enough for the quadratic complexity (\(n < 10{,}000\) typically).

Use K-means when: \(K\) is known or can be estimated, the data is large, and clusters are roughly spherical.

⚠️ Single linkage is almost never the right choice

Single linkage is notorious for the chaining effect: a single bridge point connecting two clusters causes them to merge prematurely, creating long, string-like clusters that do not reflect the true structure. Unless the true clusters are genuinely filamentous (rare), always start with Ward, complete, or average linkage and only consider single linkage if there is a specific reason.

💡 Hierarchical clustering in R

# Compute dissimilarity matrix
d <- dist(X, method="euclidean")   # or "manhattan", "maximum"

# Fit hierarchical clustering
hc <- hclust(d, method="ward.D2")  # or "complete", "average", "single"

# Visualize dendrogram
plot(hc, labels=FALSE, hang=-1)
rect.hclust(hc, k=3, border="red")  # draw K clusters

# Cut to K clusters
clusters <- cutree(hc, k=3)
# Or cut at a specific height
clusters <- cutree(hc, h=5)

# Non-Euclidean: Gower distance for mixed data
library(cluster)
d_gower <- daisy(df_mixed, metric="gower")
hc_g    <- hclust(d_gower, method="ward.D2")

# Pretty dendrograms
library(ggdendro)
ggdendrogram(hc, rotate=FALSE, labels=FALSE)