The following code implements some of the concepts of the paper:
Michael Hahsler and Kurt Hornik. Dissimilarity plots: A visual exploration tool for partitional clustering. Journal of Computational and Graphical Statistics, 10(2):335–354, June 2011.
The main question is: “Is a clustering good?”
library(seriation)
library(cluster)
library(ggplot2)
library(ggdendro)
set.seed(9999)
We use the Ruspini data set with 4 clusters (available in the cluster package). We shuffle the data to get a random order.
data("ruspini", package = "cluster")
ruspini <- ruspini[sample(nrow(ruspini)), ]
ggplot(ruspini, aes(x, y)) + geom_point()
Dentrograms are a good visualization to judge clusterability for hierarchical clustering.
ggdendrogram(hclust(dist(ruspini)))
clusplot(pam(ruspini, k = 6))
plot(silhouette(pam(ruspini, k = 6)))
But how can we visualize the output of non-hierarchical clustering?
d <- dist(ruspini)
ggpimage(d)
This is called coarse seriation. Look for dark blocks (small distances) along the diagonal. Note: Ruspini has 4 clusters but we use incorrectly 10!
l <- kmeans(ruspini, 10)$cluster
ggdissplot(d, method = NA, labels = l)
ggdissplot(d, labels = l)
Note that the plot reassembles the four clusters.
l <- kmeans(ruspini, 3)$cluster
ggdissplot(d, labels = l)
ggdissplot(d, labels = l) + lims(fill = c(0, 40))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
ggdissplot(d, labels = l) +
scale_fill_gradient(low = "blue", high = "white",
limit = c(0,40), na.value = "white")
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
Note that one cluster consists of 2 clusters (two dark blocks)!
(see ? seriate
)
ggdissplot(d,
labels = l,
method = list(intra = "ARSA", inter = "ARSA"))
ggdissplot(d,
labels = l,
method = list(intra = "TSP", inter = "TSP"))
ggdissplot(d,
labels = l,
method = list(intra = "HC_average", inter = "HC_average"))
ggdissplot(d,
labels = l,
method = list(intra = "OLO", inter = "OLO"))
ggdissplot(d,
labels = l,
method = list(intra = "MDS", inter = "MDS"))
ggdissplot(d,
labels = l,
method = list(intra = "Spectral", inter = "Spectral"))
ggdissplot(d,
labels = l,
method = list(intra = "QAP_2SUM", inter = "QAP_2SUM"))
ggdissplot(d,
labels = l,
method = list(intra = "QAP_LS", inter = "QAP_LS"))
ggdissplot(d,
labels = l,
method = list(intra = "R2E", inter = "R2E"))
ggdissplot(d)
ggdissplot(d) +
scale_fill_gradient(low = "blue", high = "white",
limit = c(0,40), na.value = "white")
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
ggdissplot(d) +
scale_fill_gradient2(low = "darkred", high = "darkblue", midpoint = 70)
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
see ?VAT
ggVAT(d)
ggiVAT(d)
iVAT redefines the distance between two objects as the minimum over the largest distances between two consecutive objects on all possible paths between the objects.
Random data should produce a bad clustering.
rand_data <- data.frame(x = runif(200), y = runif(200))
ggplot(rand_data, aes(x, y)) + geom_point()
d <- dist(rand_data)
Clusterability
ggdissplot(d)
ggVAT(d)
ggiVAT(d)
Try k-means
cl <- kmeans(rand_data, 3)
ggdissplot(d, cl$cluster)