clustering very large dataset in R
You can use kmeans
, which normally suitable for this amount of data, to calculate an important number of centers (1000, 2000, ...) and perform a hierarchical clustering approach on the coordinates of these centers.Like this the distance matrix will be smaller.
## Example
# Data
x <- rbind(matrix(rnorm(70000, sd = 0.3), ncol = 2),
matrix(rnorm(70000, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
# CAH without kmeans : dont work necessarily
library(FactoMineR)
cah.test <- HCPC(x, graph=FALSE, nb.clust=-1)
# CAH with kmeans : work quickly
cl <- kmeans(x, 1000, iter.max=20)
cah <- HCPC(cl$centers, graph=FALSE, nb.clust=-1)
plot.HCPC(cah, choice="tree")
How to cluster big data using Python or R without memory error?
For trivial reasons, the function dist
needs quadratic memory.
So if you have 1 million (10^6) points, a quadratic matrix needs 10^12 entries. With double precision, you need 8 bytes for each entry. With symmetry, you only need to store half of the entries, still that is 4*10^12 bytea., I.e. 4 Terabyte just to store this matrix. Even if you would store this on SSD or upgrade your system to 4 TB of RAM, computing all these distances would take an insane amount of time.
And 1 million is still pretty small, isn't it?
Using dist
on big data is impossible. End of story.
For larger data sets, you'll need to
- use methods such as k-means that do not use pairwise distances
- use methods such as DBSCAN that do not need a distance matrix, and where in some cases an index can reduce the effort to O(n log n)
- subsample your data to make it smaller
In particular that last thing is a good idea if you don't have a working solution yet. There is no use in struggling with scalability of a method that does not work.
How create cluster plots for large datasets in R
First off, it seems like plot() for clara objects gives two plots, the first being identical to that returned by clusplot(). If the former finished but the latter did not, I'm guessing that's just because you're clogging up the plot history. If you save large plots to png you won't run into this problem. They'll still take a while, but it won't interfere with whatever else it is you're doing.
Regarding reducing the number of plotted points, we can do this manually by adjusting the list elements of clara.x
. You just have to choose which points you want to plot. Below, I give an example where I just use the samples from the clara
method. But if you want to plot more you can choose with sample()
or something:
# Manually shrinking clara object
samp <- clara.x$sample
clara.x$data <- clara.x$data[samp, ]
clara.x$clustering <- clara.x$clustering[samp]
clara.x$i.med <- match(clara.x$i.med, samp) # point medoid indx to samp
# plot the cluster solution
clusplot(clara.x)
One delicacy is that the medoid samples must always be in whatever indices you choose to plot, otherwise the 5th line above won't work. To ensure this for any given samp
, add the following after the 2nd line above:
samp <- union(samp, clara.x$i.med)
ADDENDUM: Just saw the 1st answer, which is different from mine. He is suggesting to re-compute the clustering. A benefit to my approach is it maintains the original clustering computation and only adjusts which points you plot.
Clustering with non independent variables and very large data set
Try doing a principal components analysis on the data, then kmeans or knn on the number of dimensions you decide you want.
There are couple differnt packages that are fairly straightforward to use of this, you'll have to mean center and scale your data before. You'll also have to conver any factors into numerical using a one hot method (one column for every possible factor of that original factor column).
Look into 'prcomp' or 'princomp'
Related Topics
Apply Function to Each Column in a Data Frame Observing Each Columns Existing Data Type
How to Display Verbatim Inline R Code with Backticks Using Rmarkdown
Using Lapply to Change Column Names of a List of Data Frames
How to Properly Document a S3 Method of a Generic from a Different Package, Using Roxygen
How Make 2 Column Layout in R Markdown When Rendering PDF
Conditional Assignment of One Variable to the Value of One of Two Other Variables
Could Not Find Function Inside Foreach Loop
Get Map with Specified Boundary Coordinates
Adding Curved Flight Path Using R's Leaflet Package
Check If Each Row of a Data Frame Is Contained in Another Data Frame
Differencebetween Names and Colnames
How to Make a Dummy Variable in R
How to Get Geom_Vline to Honor Facet_Wrap