Common components

Several clustering methods have these common components:

Denoising using local KDE (typically a uniform kernel).
Calculating similarity as a graph.
Applying a threshold to the similarity graph and finding connected components.

These algorithms can be implemented using these building blocks.

library(dbscan)
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

Denoising (DBSCAN)

remove_noise <- function(x, eps, minPts)
  x[!sapply(frNN(x, eps)$id, length)+1L < minPts, ]

Similarity graph

JP clustering and SNN clustering use the shared nearest neighbor distance

nn_SNN <- function(x, k, kt) sNN(x, k = k, kt = kt)

DBSCAN uses a direct density-reachability graph

nn_reachability <- function(x, eps) frNN(x, eps)

Find connected components in a graph

connected <- function(nn_graph) 
  components(graph_from_adj_list(adjacencylist(nn_graph), 
    mode = "total"))$membership

Dataset: Spirals

library(mlbench)
set.seed(1234)
x <- mlbench.spirals(500,1,0.05)
plot(x)

x <- x$x

DBSCAN*

cl <- dbscan(x, eps = .1, minPts = 4, borderPoints = FALSE)
cl

## DBSCAN clustering for 500 objects.
## Parameters: eps = 0.1, minPts = 4
## The clustering contains 2 cluster(s) and 16 noise points.
## 
##   0   1   2 
##  16 240 244 
## 
## Available fields: cluster, eps, minPts

plot(x, col = cl$cluster + 1L, xlim = c(-1,1), ylim = c(-1,1))

Unified version

Remove noise

x2 <- remove_noise(x, eps = .1, minPts = 4)

Calculate reachability graph

g <- nn_reachability(x2, eps = .1)

Find connected components in the reachability graph

comp <- connected(g)

plot(g, x2, xlim = c(-1,1), ylim = c(-1,1))
points(x)
points(x2, col = comp+1L, xlim = c(-1,1), ylim = c(-1,1), pch = 19)

JP clustering

cl <- jpclust(x, k = 10, kt = 6)
plot(x, col = cl$cluster + 1L, xlim = c(-1,1), ylim = c(-1,1))

Unified version

Create the shared nearest neighbor graph

g <- nn_SNN(x, k = 10, kt = 6)

Find connected components

comp <- connected(g)

plot(g, data = x, xlim = c(-1,1), ylim = c(-1,1))
points(x, col = comp+1L, xlim = c(-1,1), ylim = c(-1,1), pch = 19)

SNN clustering

cl <- sNNclust(x, k = 10, eps = 5, minPts = 5)
plot(x, col = cl$cluster + 1L, xlim = c(-1,1), ylim = c(-1,1))

Unified version

eps <- 5
minPts <- 5

Create shared nearest neighbor graph (kt == eps)

g <- nn_SNN(x, k = 10, kt = eps)

use DBSCAN (i.e., remove noise, reachability and connected components)

g2 <- structure(list(id = adjacencylist(g), eps = eps), class = "frNN")
cl <- dbscan(g2, minPts = minPts)

plot(g, x,
  xlim = c(-1,1), ylim = c(-1,1))
points(x, col = cl$cluster+1L, xlim = c(-1,1), ylim = c(-1,1), pch = 19)

EMIS/CSE 8331: A Unified View on DBSCAN, JP Clustering and SNN Clustering

Michael Hahsler

Sat Feb 16 18:55:32 2019

Common components

Denoising (DBSCAN)

Similarity graph

JP clustering and SNN clustering use the shared nearest neighbor distance

DBSCAN uses a direct density-reachability graph

Find connected components in a graph

Dataset: Spirals

DBSCAN*

Unified version

JP clustering

Unified version

SNN clustering

Unified version