Clustering Very Large Dataset in R

clustering very large dataset in R

You can use kmeans, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.

## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
           matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)

# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")

How to cluster big data using Python or R without memory error?

For trivial reasons, the function dist needs quadratic memory.

So if you have 1 million (10^6) points, a quadratic matrix needs 10^12 entries. With double precision, you need 8 bytes for each entry. With symmetry, you only need to store half of the entries, still that is 4*10^12 bytea., I.e. 4 Terabyte just to store this matrix. Even if you would store this on SSD or upgrade your system to 4 TB of RAM, computing all these distances would take an insane amount of time.

And 1 million is still pretty small, isn't it?

Using dist on big data is impossible. End of story.

For larger data sets, you'll need to

use methods such as k-means that do not use pairwise distances
use methods such as DBSCAN that do not need a distance matrix, and where in some cases an index can reduce the effort to O(n log n)
subsample your data to make it smaller

In particular that last thing is a good idea if you don't have a working solution yet. There is no use in struggling with scalability of a method that does not work.

How create cluster plots for large datasets in R

First off, it seems like plot() for clara objects gives two plots, the first being identical to that returned by clusplot(). If the former finished but the latter did not, I'm guessing that's just because you're clogging up the plot history. If you save large plots to png you won't run into this problem. They'll still take a while, but it won't interfere with whatever else it is you're doing.

Regarding reducing the number of plotted points, we can do this manually by adjusting the list elements of clara.x. You just have to choose which points you want to plot. Below, I give an example where I just use the samples from the clara method. But if you want to plot more you can choose with sample() or something:

# Manually shrinking clara object
samp <- clara.x$sample
clara.x$data <- clara.x$data[samp, ]
clara.x$clustering <- clara.x$clustering[samp]
clara.x$i.med <- match(clara.x$i.med, samp) # point medoid indx to samp

# plot the cluster solution
clusplot(clara.x)

One delicacy is that the medoid samples must always be in whatever indices you choose to plot, otherwise the 5th line above won't work. To ensure this for any given samp, add the following after the 2nd line above:

samp <- union(samp, clara.x$i.med)

ADDENDUM: Just saw the 1st answer, which is different from mine. He is suggesting to re-compute the clustering. A benefit to my approach is it maintains the original clustering computation and only adjusts which points you plot.

Clustering with non independent variables and very large data set

Try doing a principal components analysis on the data, then kmeans or knn on the number of dimensions you decide you want.

There are couple differnt packages that are fairly straightforward to use of this, you'll have to mean center and scale your data before. You'll also have to conver any factors into numerical using a one hot method (one column for every possible factor of that original factor column).

Look into 'prcomp' or 'princomp'

Clustering Very Large Dataset in R