Clustering using HDBSCAN

Tarique Akhtar
2 min readJan 10, 2023

--

HDBSCAN is one of the most popular and more recent algorithm in clustering.

And as the name suggests, this clustering algorithm is being derived or extended from the DBSCAN algorithm. In other words, it overcomes some of its limitations.

Remember, one of the key limitations of the DBSCAN is the tuning process for the parameters i.e Epsilon and Min Points. It involves a lot of trials to reach out and obtain its optimum values for a particular dataset.

We will see how HDBSCAN overcomes these limitations and do an amazing job on challenging clustering tasks.

Moreover, the HDBSCAN also overcomes one of the key limitations of the hierarchical clustering. One of the limitations in the hierarchical clustering is that we aren’t always sure where to put the thresholds in the Dendrogram.

It is a density-based clustering algorithm that is particularly useful for handling datasets with varying densities and clusters of different sizes and shapes. It is implemented in the python library hdbscan.

Here’s an example of how to use the HDBSCAN algorithm to cluster a 2D dataset using Python:

import numpy as np
import hdbscan
import matplotlib.pyplot as plt

# Generate a synthetic dataset
np.random.seed(0)
data = np.random.randn(200, 2)

# Compute the clustering using HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size=5)
clusterer.fit(data)

# Plot the results
plt.scatter(data[:, 0], data[:, 1], c=clusterer.labels_, cmap='rainbow')
plt.title("HDBSCAN Clustering")
plt.show()
Result of above program

In this example, we first generate a synthetic dataset using the numpy library. The min_cluster_size parameter is set to 5, which means that HDBSCAN will only form clusters that have at least 5 points. The fit method of the HDBSCAN class takes the data as input, and the result of the clustering can be accessed via the labels_ attribute.

Then we use the matplotlib library to plot the results, where each point is colored according to its cluster label. The clustering algorithm detects the structure of the data, forming clusters of varying densities and shapes.

The HDBSCAN algorithm also allows for the specification of additional parameters, like the distance metric for the clustering, the minimum probability to form a cluster and others.

With the HDBSCAN algorithm, and the hdbscan library in Python, we can easily cluster and analyze datasets with complex structures and varying densities.

Thanks for reading.

--

--

Tarique Akhtar

Data Science Professional, Love to learn new things!!!We can get connected through LinkedIn (https://www.linkedin.com/in/tarique-akhtar-6b902651/)