Common components

Several clustering methods have these common components:

  • Denoising using local KDE (typically a uniform kernel).
  • Calculating similarity as a graph.
  • Applying a threshold to the similarity graph and finding connected components.

These algorithms can be implemented using these building blocks.

library(dbscan)
library(igraph)
## 
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## The following object is masked from 'package:base':
## 
##     union

Denoising (DBSCAN)

remove_noise <- function(x, eps, minPts)
  x[!sapply(frNN(x, eps)$id, length)+1L < minPts, ]

Similarity graph

JP clustering and SNN clustering use the shared nearest neighbor distance

nn_SNN <- function(x, k, kt) sNN(x, k = k, kt = kt)

DBSCAN uses a direct density-reachability graph

nn_reachability <- function(x, eps) frNN(x, eps)

Find connected components in a graph

connected <- function(nn_graph) 
  components(graph_from_adj_list(adjacencylist(nn_graph), 
    mode = "total"))$membership

Dataset: Spirals

library(mlbench)
set.seed(1234)
x <- mlbench.spirals(500,1,0.05)
plot(x)

x <- x$x

DBSCAN*

cl <- dbscan(x, eps = .1, minPts = 4, borderPoints = FALSE)
cl
## DBSCAN clustering for 500 objects.
## Parameters: eps = 0.1, minPts = 4
## The clustering contains 2 cluster(s) and 16 noise points.
## 
##   0   1   2 
##  16 240 244 
## 
## Available fields: cluster, eps, minPts
plot(x, col = cl$cluster + 1L, xlim = c(-1,1), ylim = c(-1,1))

Unified version

  1. Remove noise
x2 <- remove_noise(x, eps = .1, minPts = 4)
  1. Calculate reachability graph
g <- nn_reachability(x2, eps = .1)
  1. Find connected components in the reachability graph
comp <- connected(g)

plot(g, x2, xlim = c(-1,1), ylim = c(-1,1))
points(x)
points(x2, col = comp+1L, xlim = c(-1,1), ylim = c(-1,1), pch = 19)

JP clustering

cl <- jpclust(x, k = 10, kt = 6)
plot(x, col = cl$cluster + 1L, xlim = c(-1,1), ylim = c(-1,1))

Unified version

  1. Create the shared nearest neighbor graph
g <- nn_SNN(x, k = 10, kt = 6)
  1. Find connected components
comp <- connected(g)

plot(g, data = x, xlim = c(-1,1), ylim = c(-1,1))
points(x, col = comp+1L, xlim = c(-1,1), ylim = c(-1,1), pch = 19)

SNN clustering

cl <- sNNclust(x, k = 10, eps = 5, minPts = 5)
plot(x, col = cl$cluster + 1L, xlim = c(-1,1), ylim = c(-1,1))

Unified version

eps <- 5
minPts <- 5
  1. Create shared nearest neighbor graph (kt == eps)
g <- nn_SNN(x, k = 10, kt = eps)
  1. use DBSCAN (i.e., remove noise, reachability and connected components)
g2 <- structure(list(id = adjacencylist(g), eps = eps), class = "frNN")
cl <- dbscan(g2, minPts = minPts)

plot(g, x,
  xlim = c(-1,1), ylim = c(-1,1))
points(x, col = cl$cluster+1L, xlim = c(-1,1), ylim = c(-1,1), pch = 19)