The following code implements some of the concepts of the paper:

Michael Hahsler and Kurt Hornik. Dissimilarity plots: A visual exploration tool for partitional clustering. Journal of Computational and Graphical Statistics, 10(2):335–354, June 2011.

The main question is: “Is a clustering good?”

Data

library(seriation)
library(cluster)
library(ggplot2)
library(ggdendro)
set.seed(9999)

We use the Ruspini data set with 4 clusters (available in the cluster package). We shuffle the data to get a random order.

data("ruspini", package = "cluster")
ruspini <- ruspini[sample(nrow(ruspini)), ]
ggplot(ruspini, aes(x, y)) + geom_point()

Dendrograms, clusplot, etc.

Dentrograms are a good visualization to judge clusterability for hierarchical clustering.

ggdendrogram(hclust(dist(ruspini)))

clusplot(pam(ruspini, k = 6))

plot(silhouette(pam(ruspini, k = 6)))

But how can we visualize the output of non-hierarchical clustering?

Dissimilarity matrix shading

Original data

d <- dist(ruspini)
ggpimage(d)

Reorder using cluster labels

This is called coarse seriation. Look for dark blocks (small distances) along the diagonal. Note: Ruspini has 4 clusters but we use incorrectly 10!

l <- kmeans(ruspini, 10)$cluster
ggdissplot(d, method = NA, labels = l)

With reordering

ggdissplot(d, labels = l)

Note that the plot reassembles the four clusters.

Using too few clusters

l <- kmeans(ruspini, 3)$cluster

ggdissplot(d, labels = l)

ggdissplot(d, labels = l) + lims(fill = c(0, 40)) 
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

ggdissplot(d, labels = l) + 
  scale_fill_gradient(low = "blue", high = "white", 
    limit = c(0,40), na.value = "white") 
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

Note that one cluster consists of 2 clusters (two dark blocks)!

Use different seriation methods

(see ? seriate)

Linear Seriation Criterion

ggdissplot(d,
  labels = l,
  method = list(intra = "ARSA", inter = "ARSA"))

Hamiltonian path

ggdissplot(d,
  labels = l,
  method = list(intra = "TSP", inter = "TSP"))

Hierarchical clustering

ggdissplot(d,
  labels = l,
  method = list(intra = "HC_average", inter = "HC_average"))

ggdissplot(d,
  labels = l,
  method = list(intra = "OLO", inter = "OLO"))

Scaling and others

ggdissplot(d,
  labels = l,
  method = list(intra = "MDS", inter = "MDS"))

ggdissplot(d,
  labels = l,
  method = list(intra = "Spectral", inter = "Spectral"))

ggdissplot(d,
  labels = l,
  method = list(intra = "QAP_2SUM", inter = "QAP_2SUM"))

ggdissplot(d,
  labels = l,
  method = list(intra = "QAP_LS", inter = "QAP_LS"))

ggdissplot(d,
  labels = l,
  method = list(intra = "R2E", inter = "R2E"))

Clusterability

Using Dissplot

ggdissplot(d)

ggdissplot(d) +
  scale_fill_gradient(low = "blue", high = "white", 
    limit = c(0,40), na.value = "white") 
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

ggdissplot(d) + 
  scale_fill_gradient2(low = "darkred", high = "darkblue", midpoint = 70)
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

Visual Analysis for Cluster Tendency Assessment (VAT/iVAT)

see ?VAT

ggVAT(d)

ggiVAT(d)

iVAT redefines the distance between two objects as the minimum over the largest distances between two consecutive objects on all possible paths between the objects.

Random data

Random data should produce a bad clustering.

rand_data <- data.frame(x = runif(200), y = runif(200))
ggplot(rand_data, aes(x, y)) + geom_point()

d <- dist(rand_data)

Clusterability

ggdissplot(d)

ggVAT(d)

ggiVAT(d)

Try k-means

cl <- kmeans(rand_data, 3)
ggdissplot(d, cl$cluster)