Density-Based Spatial Clustering of Applications with Noise

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It was first proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu in 1996. You can read the original paper here. The core concept is to find clusters based on the density of points. DBSCAN identifies itself as the number of clusters, contrary to other popular clustering algorithm K-means.

The behaviour of DBSCAN is controlled by setting two parameters:

Epsilon (eps) which defines the radius of the neighbour around a point. It determines how close points should be to each other to be considered part of a cluster.
A minimum number of points (minPts) defines the minimum number of points required to form a dense region.

Simply speaking the procedure of clustering with DBSCAN is as follows:

1. Set parameters eps and minPts

2. Classification of each point into one of three categories:

Core points — has at least minPts within its eps radius,
Border points — a point that has fewer than minPts within its eps radius but is close to a core point,
Noise points — is not a core point or a border point.

3. Cluster formation — starting from a random core point, DBSCAN cluster points by recursively traversing through points and adding reachable points to the cluster. Border points form, of course, the border of the cluster.

4. Repeat step 3 until all core points belong to a cluster.

5. Put all non-clustered points to the outlier’s cluster.

This algorithm is included in an extremely popular data science library scikit-learn and we will use this implementation here.

Ball-tree algorithm

After dealing with the correct calculation of distance on a sphere there is one more thing to consider. DBSCAN in its naïve implementation uses brute-force to calculate distances between each point which is extremely memory inefficient. So long as you have a small or moderate database, and you can fit the distance matrix into your memory it is not an issue. However, for huge databases it is better to use a more memory-efficient algorithm to calculate distances: a ball tree. It makes DBSCAN more sensitive to epsilon value which is worth keeping in mind and is more complex to implement but it is already supported in scikit learn. We must include in scikit-learn DBSCAN an additional parameter algorithm=ball_tree.

Example: Clustering of gun violence dataset

The data set contains multiple columns which are not required for this demonstration, therefore we will only use three of them: incident_id, latitude and longitude. After dealing with missing values we have to change degrees to radians.

data.dropna(inplace=True)
data["latitude_rad"] = np.deg2rad(data["latitude"])
data["longitude_rad"] = np.deg2rad(data["longitude"])

Next it’s time for configuring DBSCAN parameters. Let’s say we want to cluster all incidents that are within 200m from each other and there are at least 10 of them within that radius. This is, de facto, a definition of the core point.

kms_per_radian = 6371.0088  # radius of Earth in kilometers
epsilon = 0.2 / kms_per_radian
clustering = DBSCAN(eps=epsilon,
                    min_samples=10,
                    algorithm='ball_tree',
                    metric='haversine'
                    ).fit(data[["latitude_rad","longitude_rad"]])

Now, we can explore clusters generated by the algorithm. In this case the biggest cluster will be the one containing outliers (ca. 196,000 records). Below there is a box plot depicting sizes of 20 the biggest clusters (excluding outliers). Cluster number 18 contains almost 1,000 incidents. Let’s visualise it on the map.

Of course you have to optimise DBSCAN parameters to get clusters meeting your needs. In this case clusters 60 and 1 are both in Baltimore but are not treated as a single cluster (picture below). This is due to the narrow “bride” between them. By post-processing or optimalisation of DBSCAN parameters it is possible to fix it if this is required.

Previousk-means NextTree Algorithm

Last updated 2 months ago