Clustering a large, very sparse, binary matrix in R
I've written some Rcpp code and R code which works out the binary/Jaccard distance of a binary matrix approx. 80x faster than dist(x, method = "binary")
. It converts the input matrix into a raw matrix which is the transpose of the input (so that the bit patterns are in the correct order internally). This is then used in some C++ code which handles the data as 64 bit unsigned integers for speed. The Jaccard distance of two vectors x and y is equal to x ^ y / (x | y)
where ^
is the xor operator. The Hamming Weight calculation is used to count the number of bits set if the result of the xor
or or
is non-zero.
I've put together the code on github at https://github.com/NikNakk/binaryDist/ and reproduced the two files below. I've confirmed that the results are the same as dist(x, method = "binary")
for a few random datasets.
On a dataset of 39000 rows by 14000 columns with 1-5 ones per row, it took about 11 minutes. The output distance matrix was 5.7 GB.
bDist.cpp
#include <Rcpp.h>
using namespace Rcpp;
//countBits function taken from https://en.wikipedia.org/wiki/Hamming_weight#Efficient_implementation
const uint64_t m1 = 0x5555555555555555; //binary: 0101...
const uint64_t m2 = 0x3333333333333333; //binary: 00110011..
const uint64_t m4 = 0x0f0f0f0f0f0f0f0f; //binary: 4 zeros, 4 ones ...
const uint64_t h01 = 0x0101010101010101; //the sum of 256 to the power of 0,1,2,3...
int countBits(uint64_t x) {
x -= (x >> 1) & m1; //put count of each 2 bits into those 2 bits
x = (x & m2) + ((x >> 2) & m2); //put count of each 4 bits into those 4 bits
x = (x + (x >> 4)) & m4; //put count of each 8 bits into those 8 bits
return (x * h01)>>56; //returns left 8 bits of x + (x<<8) + (x<<16) + (x<<24) + ...
}
// [[Rcpp::export]]
int countBitsFromRaw(RawVector rv) {
uint64_t* x = (uint64_t*)RAW(rv);
return(countBits(*x));
}
// [[Rcpp::export]]
NumericVector bDist(RawMatrix mat) {
int nr(mat.nrow()), nc(mat.ncol());
int nw = nr / 8;
NumericVector res(nc * (nc - 1) / 2);
// Access the raw data as unsigned 64 bit integers
uint64_t* data = (uint64_t*)RAW(mat);
uint64_t a(0);
// Work through each possible combination of columns (rows in the original integer matrix)
for (int i = 0; i < nc - 1; i++) {
for (int j = i + 1; j < nc; j++) {
uint64_t sx = 0;
uint64_t so = 0;
// Work through each 64 bit integer and calculate the sum of (x ^ y) and (x | y)
for (int k = 0; k < nw; k++) {
uint64_t o = data[nw * i + k] | data[nw * j + k];
// If (x | y == 0) then (x ^ y) will also be 0
if (o) {
// Use Hamming weight method to calculate number of set bits
so = so + countBits(o);
uint64_t x = data[nw * i + k] ^ data[nw * j + k];
if (x) {
sx = sx + countBits(x);
}
}
}
res(a++) = (double)sx / so;
}
}
return (res);
}
R source
library("Rcpp")
library("plyr")
sourceCpp("bDist.cpp")
# Converts a binary integer vector into a packed raw vector,
# padding out at the end to make the input length a multiple of packWidth
packRow <- function(row, packWidth = 64L) {
packBits(as.raw(c(row, rep(0, (packWidth - length(row)) %% packWidth))))
}
as.PackedMatrix <- function(x, packWidth = 64L) {
UseMethod("as.PackedMatrix")
}
# Converts a binary integer matrix into a packed raw matrix
# padding out at the end to make the input length a multiple of packWidth
as.PackedMatrix.matrix <- function(x, packWidth = 64L) {
stopifnot(packWidth %% 8 == 0, class(x) %in% c("matrix", "Matrix"))
storage.mode(x) <- "raw"
if (ncol(x) %% packWidth != 0) {
x <- cbind(x, matrix(0L, nrow = nrow(x), ncol = packWidth - (ncol(x) %% packWidth)))
}
out <- packBits(t(x))
dim(out) <- c(ncol(x) %/% 8, nrow(x))
class(out) <- "PackedMatrix"
out
}
# Converts back to an integer matrix
as.matrix.PackedMatrix <- function(x) {
out <- rawToBits(x)
dim(out) <- c(nrow(x) * 8L, ncol(x))
storage.mode(out) <- "integer"
t(out)
}
# Generates random sparse data for testing the main function
makeRandomData <- function(nObs, nVariables, maxBits, packed = FALSE) {
x <- replicate(nObs, {
y <- integer(nVariables)
y[sample(nVariables, sample(maxBits, 1))] <- 1L
if (packed) {
packRow(y, 64L)
} else {
y
}
})
if (packed) {
class(x) <- "PackedMatrix"
x
} else {
t(x)
}
}
# Reads a binary matrix from file or character vector
# Borrows the first bit of code from read.table
readPackedMatrix <- function(file = NULL, text = NULL, packWidth = 64L) {
if (missing(file) && !missing(text)) {
file <- textConnection(text)
on.exit(close(file))
}
if (is.character(file)) {
file <- file(file, "rt")
on.exit(close(file))
}
if (!inherits(file, "connection"))
stop("'file' must be a character string or connection")
if (!isOpen(file, "rt")) {
open(file, "rt")
on.exit(close(file))
}
lst <- list()
i <- 1
while(length(line <- readLines(file, n = 1)) > 0) {
lst[[i]] <- packRow(as.integer(strsplit(line, "", fixed = TRUE)[[1]]), packWidth = packWidth)
i <- i + 1
}
out <- do.call("cbind", lst)
class(out) <- "PackedMatrix"
out
}
# Wrapper for the C++ code which
binaryDist <- function(x) {
if (class(x) != "PackedMatrix") {
x <- as.PackedMatrix(x)
}
dst <- bDist(x)
attr(dst, "Size") <- ncol(x)
attr(dst, "Diag") <- attr(dst, "Upper") <- FALSE
attr(dst, "method") <- "binary"
attr(dst, "call") <- match.call()
class(dst) <- "dist"
dst
}
x <- makeRandomData(2000, 400, maxBits = 5, packed = TRUE)
system.time(bd <- binaryDist(x))
From original answer:
Other things to consider would be doing some prefiltering of comparisons between two rows with single ones since the distance will either be 0 for duplicates or 1 for any other possibility.
A couple of relatively straightforward options that might be faster without needing much code are the vegdist
function from the vegan package and the Dist
function from the amap package. The latter will probably only be quicker if you have multiple cores and take advantage of the fact it supports parallelisation.
Hierarchical Clustering Large Sparse Distance Matrix R
While using a sparse matrix in the first place seems like a good idea, I think there is a bit of a problem with that approach: your missing distances will be coded as 0
s, not as NA
s (see Creating (and Accessing) a Sparse Matrix with NA default entries). As you know, when clustering, a zero dissimilarity has a totally different meaning than a missing one...
So anyway, what you need is a dist object with a lot of NA
s for your missing dissimilarities. Unfortunately, your problem is so big that it would require too much memory:
d <- dist(x = rep(NA_integer_, 50000))
# Error: cannot allocate vector of size 9.3 Gb
And that's only dealing with the input... Even with a 64 bit machine with a lot of memory, I'm not sure the clustering algorithm itself would not choke or run indefinitely.
You should consider breaking your problem into smaller pieces.
K-means with really large matrix
Does it have to be K-means? Another possible approach is to transform your data into a network first, then apply graph clustering. I am the author of MCL, an algorithm used quite often in bioinformatics. The implementation linked to should easily scale up to networks with millions of nodes - your example would have 300K nodes, assuming that you have 100K attributes. With this approach, the data will be naturally pruned in the data transformation step - and that step will quite likely become the bottleneck. How do you compute the distance between two vectors? In the applications that I have dealt with I used the Pearson or Spearman correlation, and MCL is shipped with software to efficiently perform this computation on large scale data (it can utilise multiple CPUs and multiple machines).
There is still an issue with the data size, as most clustering algorithms will require you to at least perform all pairwise comparisons at least once. Is your data really stored as a giant matrix? Do you have many zeros in the input? Alternatively, do you have a way of discarding smaller elements? Do you have access to more than one machine in order to distribute these computations?
Related Topics
How to Set Seed for Random Simulations with Foreach and Domc Packages
Multiple Lines for Text Per Legend Label in Ggplot2
Three-Way Color Gradient Fill in R
Keeping Zero Count Combinations When Aggregating with Data.Table
Increase the API Limit in Ggmap's Geocode Function (In R)
Writing to a Dataframe from a For-Loop in R
Creating a Facet_Wrap Plot with Ggplot2 with Different Annotations in Each Plot
How to Include Rmarkdown File in R Package
Replacing Nas in R with Nearest Value
R: Remove Multiple Empty Columns of Character Variables
R 3.4.1 "Single Candle" Personal Library Path Error: Unable to Create 'Na'
Reversed Order After Coord_Flip in R
Apply Grouped Model Back Onto Data
Adding Text to Ggplot Geom_Jitter Points That Match a Condition
Anti-Aliasing in R Graphics Under Windows (As Per MAC)
Assign New Data Point to Cluster in Kernel K-Means (Kernlab Package in R)