Clustering Algorithm for Obtaining Equal Sized Clusters

Clustering with Specific Sized Groups

There is a tutorial on how to implement such an algorithm in ELKI:

http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans

Also have a look at constraint clustering algorithms; although usually these algorithms only support "Must link" and "cannot link" constraints, not size constraints.

You should be able to do a similar modification where you first specify the group sizes, then assign points randomly, and swap cluster members as long as your objective function improves; similar to k-means / k-medoids. As you may get stuck in local minima, restart a number of times and only keep the best.

See also earlier questions, e.g.
K-means algorithm variation with equal cluster size
and
Group n points in k clusters of equal size

How to create clusters with equal sizes

There are a few discussions on this topic.

https://elki-project.github.io/tutorial/same-size_k_means

Group n points in k clusters of equal size

K-means algorithm variation with equal cluster size

Also, check out Affinity Propagation and DBSCAN. Both are great alternatives to the very popular K-Means algo, and both find the optimal number of clusters automatically, unlike K-Means.

https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

I'm not saying that these will give you clusters of equal sizes, but it's good to know about these other alternatives, and using these methodologies is are probably more practical than forcing clusters to have an equal number of data points. Clustering is an unsupervised type of analysis. It seems like forcing clusters to have equal sizes of results is somewhat of a forced method, and almost supervised, which it is not designed to be.

Algorithm for clustering with minimum size constraints

To get a minimal (unfortunately not minimum) solution:

  1. First, greedily recluster any points that you can without violating
    the minimum size constraint.
  2. Next, build a directed multigraph as follows:
    1. Every cluster becomes a node.
    2. An edge (A,B) is added for every point in A that is closer to the center of B (so that there are potentially multiple edges between the same pair of nodes); its weight should be proportional to the benefit gained by moving it.
  3. Looking for cycles in this graph will allow you to find moves
    (where a move consists of moving every vertex in the cycle).
  4. Pick the cycle with the highest total weight and recluster the nodes corresponding to the edges.
  5. Repeat steps 1-4 until there are no more cycles.

Creating the graph will have a complexity of O(kn), where you have k and n total points, and can create the same number of multiedges. Tarjan's algorithm will have complexity of O(k2), assuming that you skip multiedges to the same destination in the DFS. Every time you eliminate a cycle, you decrease the global distance by some amount and remove at least one edge from the graph, so the total time of the algorithm cannot exceed O(k4m2). That is quite extravagant; I'm sure there could be heuristics (and probably more detailed analysis) to improve the lower bound.



Related Topics



Leave a reply



Submit