Scikit-Learn Dbscan Memory Usage

scikit-learn DBSCAN memory usage

The problem apparently is a non-standard DBSCAN implementation in scikit-learn.

DBSCAN does not need a distance matrix. The algorithm was designed around using a database that can accelerate a regionQuery function, and return the neighbors within the query radius efficiently (a spatial index should support such queries in O(log n)).

The implementation in scikit however, apparently, computes the full O(n^2) distance matrix, which comes at a cost both memory-wise and runtime-wise.

So I see two choices:

You may want to try the DBSCAN implementation in ELKI instead, which when used with an R*-tree index usually is substantially faster than a naive implementation.
Otherwise, you may want to reimplement DBSCAN, as the implementation in scikit apparently isn't too good. Don't be scared of that: DBSCAN is really simple to implement yourself. The trickiest part of a good DBSCAN implementation is actually the regionQuery function. If you can get this query fast, DBSCAN will be fast. And you can actually reuse this function for other algorithms, too.

Update: by now, sklearn no longer computes a distance matrix and can, e.g., use a kd-tree index. However, because of "vectorization" it will still precompute the neighbors of every point, so the memory usage of sklearn for large epsilon is O(n²), whereas to my understanding the version in ELKI will only use O(n) memory. So if you run out of memory, choose a smaller epsilon and/or try ELKI.

Scikit-Learn DBSCAN creating a cluster with two highly disconnected islands

This is interesting. I added the cluster number and we see that for the purple it is -1.

Sample Image

That means NO CLUSTER . So that's how the pair in the upper left can "share" the same "cluster" with the ones on the far right: they do not actually share the cluster but just the identifier that sckit-learn uses for NO cluster.

Unexpected dbscan result

i found the problem. limited color list! i should add more colors because DBSCAN generates new cluster but with same color and that was my mistake.

so this is working fine now:

dbscan_opt=DBSCAN(eps=3.0,min_samples=1).fit(data[[0,1]])

data['DBSCAN_opt_labels']=dbscan_opt.labels_
data['DBSCAN_opt_labels'].value_counts()

print(data)

# Plotting the resulting clusters
colors=['purple','red','blue','green', 'magenta', 'teal', 'black']
plt.figure(figsize=(6,6))
plt.scatter(data[0],data[1],c=data['DBSCAN_opt_labels'],cmap=matplotlib.colors.ListedColormap(colors),s=60)
plt.title('DBSCAN Clustering',fontsize=20)
plt.xlabel('Feature 1',fontsize=14)
plt.ylabel('Feature 2',fontsize=14)
plt.show()

Can we refit or fit in in parts clustering algorithms?

If the number of features in your dataset is not too much (below 20-25), you can consider using BIRCH. It's an iterative method that can be used for large datasets. In each iteration it builds a tree with only a small sample of data and put each instance into clusters.

Scikit-Learn Dbscan Memory Usage