Choosing Eps and Minpts for Dbscan (R)

Choosing eps and minpts for DBSCAN (R)?

There is no general way of choosing minPts. It depends on what you want to find. A low minPts means it will build more clusters from noise, so don't choose it too small.

For epsilon, there are various aspects. It again boils down to choosing whatever works on this data set and this minPts and this distance function and this normalization. You can try to do a knn distance histogram and choose a "knee" there, but there might be no visible one, or multiple.

OPTICS is a successor to DBSCAN that does not need the epsilon parameter (except for performance reasons with index support, see Wikipedia). It's much nicer, but I believe it is a pain to implement in R, because it needs advanced data structures (ideally, a data index tree for acceleration and an updatable heap for the priority queue), and R is all about matrix operations.

Naively, one can imagine OPTICS as doing all values of Epsilon at the same time, and putting the results in a cluster hierarchy.

The first thing you need to check however - pretty much independent of whatever clustering algorithm you are going to use - is to make sure you have a useful distance function and appropriate data normalization. If your distance degenerates, no clustering algorithm will work.

Does minpts=4 is the best setting for any dataset using DBSCAN algorithm for clustering?

In later work, the authors suggest to use minPts = 2 * dim as default.

J. Sander, M. Ester, H.-P. Kriegel, and X. Xu. 1998.
Density-Based Clustering in Spatial Databases:
The Algorithm GDBSCAN and its Applications.

Data Mining and Knowledge Discovery 2, 2 (1998), 169–194.

http://dx.doi.org/10.1023/A:1009745219419

If you have duplicates, use a larger value:
"Our experiments indicate that this value works well for databases D where each point occurs only once, i.e., if D is really a set of points."

Smaller values are usually more computationally efficient. Thus, keep minPts small but not too small.

Always study your result. Never use it without double checking.

How do I determine the distance / eps for DBSCAN in R?

First calculate the distance matrix of your data. Then, instead of using method='row' you could use method='dist'. In this way, dbscan will treat your data as distance matrix and so no need to worry about how distance function is implemented. Note that this might require more memory since you're pre-calculating distance matrix and store it in memory.



Related Topics



Leave a reply



Submit