The following code implements some of the concepts of the paper:
Michael Hahsler and Kurt Hornik. Dissimilarity plots: A visual exploration tool for partitional clustering. Journal of Computational and Graphical Statistics, 10(2):335–354, June 2011.
The main question is: “Is a clustering good?”
library(seriation)
set.seed(1234)
We use the Ruspini data set with 4 clusters (available in the cluster package). We shuffle the data to get a random order.
data("ruspini", package="cluster")
ruspini <- ruspini[sample(nrow(ruspini)),]
plot(ruspini)
Dentrograms are a good visualization to judge clusterability for hierarchical clustering.
plot(hclust(dist(ruspini)))
But how can we visualize the output of non-hierarchical clustering?
d <- dist(ruspini)
pimage(d)
This is called coarse seriation. Look for dark blocks (small distances) along the diagonal. Note: Ruspini has 4 clusters but we use incorrectly 10!
l <- kmeans(ruspini, 10)$cluster
dissplot(d, method = NA, labels = l)
dissplot(d, labels = l)
Note that the plot reassembles the four clusters.
l <- kmeans(ruspini, 3)$cluster
dissplot(d, labels = l)
dissplot(d, labels = l, zlim= c(0,40))
Note that one cluster consists of 2 clusters (two dark blocks)!
(see ? seriate
)
dissplot(d, labels= l, method = list(intra="ARSA", inter="ARSA"))
dissplot(d, labels= l, method = list(intra="TSP", inter="TSP"))
dissplot(d, labels= l, method = list(intra="HC_average", inter="HC_average"))
dissplot(d, labels= l, method = list(intra="OLO", inter="OLO"))
dissplot(d, labels= l, method = list(intra="MDS", inter="MDS"))
dissplot(d, labels= l, method = list(intra="Spectral", inter="Spectral"))
dissplot(d, labels= l, method = list(intra="QAP_2SUM", inter="QAP_2SUM"))
dissplot(d, labels= l, method = list(intra="QAP_LS", inter="QAP_LS"))
dissplot(d, labels= l, method = list(intra="R2E", inter="R2E"))
dissplot(d)
dissplot(d, zlim=c(0,40))
dissplot(d, col=bluered(100, bias=1))
see ?VAT
VAT(d)
iVAT(d)
iVAT redefines the distance between two objects as the minimum over the largest distances between two concecutive objects on all possible paths between the objects.
Random data should produce a bad clustering.
rand_data <- data.frame(x = runif(200), y = runif(200))
plot(rand_data)
d <- dist(rand_data)
Clusterability
dissplot(d)
VAT(d)
iVAT(d)
Try k-means
cl <- kmeans(rand_data, 3)
dissplot(d, cl$cluster)