Clustering with Specific Sized Groups
There is a tutorial on how to implement such an algorithm in ELKI:
http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans
Also have a look at constraint clustering algorithms; although usually these algorithms only support "Must link" and "cannot link" constraints, not size constraints.
You should be able to do a similar modification where you first specify the group sizes, then assign points randomly, and swap cluster members as long as your objective function improves; similar to k-means / k-medoids. As you may get stuck in local minima, restart a number of times and only keep the best.
See also earlier questions, e.g.
K-means algorithm variation with equal cluster size
and
Group n points in k clusters of equal size
How to create clusters with equal sizes
There are a few discussions on this topic.
https://elki-project.github.io/tutorial/same-size_k_means
Group n points in k clusters of equal size
K-means algorithm variation with equal cluster size
Also, check out Affinity Propagation and DBSCAN. Both are great alternatives to the very popular K-Means algo, and both find the optimal number of clusters automatically, unlike K-Means.
https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
I'm not saying that these will give you clusters of equal sizes, but it's good to know about these other alternatives, and using these methodologies is are probably more practical than forcing clusters to have an equal number of data points. Clustering is an unsupervised type of analysis. It seems like forcing clusters to have equal sizes of results is somewhat of a forced method, and almost supervised, which it is not designed to be.
Algorithm for clustering with minimum size constraints
To get a minimal (unfortunately not minimum) solution:
- First, greedily recluster any points that you can without violating
the minimum size constraint. - Next, build a directed multigraph as follows:
- Every cluster becomes a node.
- An edge (A,B) is added for every point in A that is closer to the center of B (so that there are potentially multiple edges between the same pair of nodes); its weight should be proportional to the benefit gained by moving it.
- Looking for cycles in this graph will allow you to find moves
(where a move consists of moving every vertex in the cycle). - Pick the cycle with the highest total weight and recluster the nodes corresponding to the edges.
- Repeat steps 1-4 until there are no more cycles.
Creating the graph will have a complexity of O(kn), where you have k and n total points, and can create the same number of multiedges. Tarjan's algorithm will have complexity of O(k2), assuming that you skip multiedges to the same destination in the DFS. Every time you eliminate a cycle, you decrease the global distance by some amount and remove at least one edge from the graph, so the total time of the algorithm cannot exceed O(k4m2). That is quite extravagant; I'm sure there could be heuristics (and probably more detailed analysis) to improve the lower bound.
Related Topics
Rank a Vector Based on Order and Replace Ties with Their Average
R Partial Reshape Data from Long to Wide
Keep Before and After Date of an External List
Change Level of Multiple Factor Variables
Using If Else Conditions on Vectors
R Ggplot2 Add Today's Date to the Title
Running Multiple Linear Regressions Across Several Columns of a Data Frame in R
Scale_Color_Manual Colors Won't Change
Trying to Find Row Associated with Max Value in Dataframe R
Difference Between Mean(C(1,2,21)) and Mean(1,2,21)
Multiple Graphs of Each Time Series
How to Plot the Results of a Mixed Model
How to Remove "Rows" with a Na Value
Read Lines by Number from a Large File
How to Prevent Exposure of My Password When Using Rgoogledocs
Traceback() for Interactive and Non-Interactive R Sessions