How to Use a List as a Hash in R? If So, Why Is It So Slow

Can I use a list as a hash in R? If so, why is it so slow?

The underlying reason is that R lists with named elements are not hashed. Hash lookups are O(1), because during insert the key is converted to an integer using a hash function, and then the value put in the space hash(key) % num_spots of an array num_spots long (this is a big simplification and avoids the complexity of dealing with collisions). Lookups of the key just require hashing the key to find the value's position (which is O(1), versus a O(n) array lookup). R lists use name lookups which are O(n).

As Dirk says, use the hash package. A huge limitation with this is that it uses environments (which are hashed) and overriding of [ methods to mimic hash tables. But an environment cannot contain another environment, so you cannot have nested hashes with the hash function.

A while back I worked on implementing a pure hash table data structure in C/R that could be nested, but it went on my project back burner while I worked on other things. It would be nice to have though :-)

R: Fast hash search in lists (environment)

Here's an example using enviroment and data.table, the code is pretty self-explanatory :

library(data.table)

# create a big random example (160k rows)
set.seed(123)
fromTo <- expand.grid(1:400,1:400)
colnames(fromTo) <- c('a','b')
DF <- as.data.frame(cbind(fromTo,time=as.integer(runif(nrow(fromTo), min = 1, max=500))))

# setup the environment to use it as hashtable:
# we simply put the times inside an enviroment using 
# a|b (concatenation of a with b) as key
timesList <- as.list(DF$time)
names(timesList) <- paste(DF$a,DF$b,sep='|')
timesEnv <- list2env(timesList)  

# setup the data.table to use it as hashtable
DT <- setDT(DF,key=c('a','b'))

# create search functions
searchUsingEnv <- function(a,b){
  time <- get(paste(a,b,sep='|'),envir=timesEnv,inherits=FALSE)  
  return(time)
}
searchUsingDataTable <- function(from,to){
  time <- DT[.(from,to),time]
  return(time)
}

Benchmark :

# benchmark functions
# i.e. we try to search ~16K rows in ourtwo kind of hashtables
benchEnv <- function(){
  n <- nrow(fromTo)
  s <- as.integer(n * 0.9)
  for(i in s:n){
    searchUsingEnv(fromTo[i,'a'],fromTo[i,'b'])
  }
}
benchDT <- function(){
  n <- nrow(fromTo)
  s <- as.integer(n * 0.9)
  for(i in s:n){
    searchUsingDataTable(fromTo[i,'a'],fromTo[i,'b'])
  }
}

# let's measure the performances
> system.time(benchEnv(), gcFirst = TRUE)
user  system elapsed 
2.26    0.00    2.30 
> system.time(benchDT(), gcFirst = TRUE)
user  system elapsed 
42.34    0.00   42.56

Conclusions:

environment seems much faster then data.table for repeated single key access, so you can try to use it.

EDIT :

Enviroments have fast access but they can only have string keys which occupy more memory than doubles. So, I've added an example using Rcpp and std::map<> with a multiple values map :

(note: if you are on Windows you need to install RTools in order to make Rcpp work)

library(data.table)
library(Rcpp)
library(inline)

nRows <- 1e7

############# create data.table "DT" containing coordinates and times
generate_routes_dt <- function(nmax) {
  set.seed(123)
  routes <- data.table(lat1 = numeric(nmax),
    lng1 = numeric(nmax),
    lat2 = numeric(nmax),
    lng2 = numeric(nmax),
    time = numeric(nmax))
  tmp <- sample(seq(46, 49, length.out = nmax), nmax)
  routes$lat1 <- tmp
  tmp <- sample(seq(8, 10, length.out = nmax), nmax)
  routes$lng1 <- tmp
  tmp <- sample(seq(46, 49, length.out = nmax), nmax)
  routes$lat2 <- tmp
  tmp <- sample(seq(8, 10, length.out = nmax), nmax)
  routes$lng2 <- tmp
  tmp <- sample(seq(0, 1e7, length.out = nmax), nmax)
  routes$time <- as.integer(tmp)
  data.table::setkey(routes, lat1, lng1, lat2, lng2)
  return(routes)
}

DT <- generate_routes_dt(nRows)

############# create data.table search function
searchUsingDataTable <- function(lat_1,lng_1,lat_2,lng_2){
  time <- DT[.(lat_1,lng_1,lat_2,lng_2),time]
  return(time)
}
#############

############# create Rcpp search function
# the following code create 2 functions: createMap and getTime
# usage:
#   map <- createMap(lat1Vec,lng1Vec,lat2Vec,lng2Vec,timesVec)
#   t <- getTime(map,lat1,lng1,lat2,lng2)
sourceCpp(code=
'
#include <Rcpp.h>

  class MultiKey {
  public:
    double  lat1;
    double  lng1;
    double  lat2;
    double  lng2;

    MultiKey(double la1, double ln1, double la2, double ln2)
      : lat1(la1), lng1(ln1), lat2(la2), lng2(ln2) {}  

    bool operator<(const MultiKey &right) const 
    {
      if ( lat1 == right.lat1 ) {
            if ( lng1 == right.lng1 ) {
                if ( lat2 == right.lat2 ) {
                    return lng2 < right.lng2;
                }
                else {
                    return lat2 < right.lat2;
                }
            }
            else {
                return lng1 < right.lng1;
            }
        }
        else {
            return lat1 < right.lat1;
        }
    }    
  };


  // [[Rcpp::export]]
  SEXP createMap(Rcpp::NumericVector lat1, 
                 Rcpp::NumericVector lng1, 
                 Rcpp::NumericVector lat2, 
                 Rcpp::NumericVector lng2, 
                 Rcpp::NumericVector times){
    std::map<MultiKey, double>* map = new std::map<MultiKey, double>;
    int n1 = lat1.size();
    int n2 = lng1.size();
    int n3 = lat2.size();
    int n4 = lng2.size();
    int n5 = times.size();
    if(!(n1 == n2 && n2 == n3 && n3 == n4 && n4 == n5)){
      throw std::range_error("input vectors lengths are different");
    }
    for(int i = 0; i < n1; i++){
      MultiKey key(lat1[i],lng1[i],lat2[i],lng2[i]);
      map->insert(std::pair<MultiKey, double>(key, times[i]));
    }
    Rcpp::XPtr< std::map<MultiKey, double> > p(map, true);
    return( p );
  }

  // [[Rcpp::export]]
  Rcpp::NumericVector getTime(SEXP mapPtr, 
                              double lat1, 
                              double lng1, 
                              double lat2, 
                              double lng2){
    Rcpp::XPtr< std::map<MultiKey, double> > ptr(mapPtr);
    MultiKey key(lat1,lng1,lat2,lng2);
    std::map<MultiKey,double>::iterator it = ptr->find(key);
    if(it == ptr->end())
        return R_NilValue;

    return Rcpp::wrap(it->second);
  }

')

map <- createMap(DT$lat1,DT$lng1,DT$lat2,DT$lng2,DT$time)

searchUsingRcpp <- function(lat_1,lng_1,lat_2,lng_2){
  time <- getTime(map,lat_1,lng_1,lat_2,lng_2)
  return(time)
}
#############

############# benchmark
set.seed(1234)
rowsToSearchOneByOne <- DT[sample.int(nrow(DT),size=nrow(DT),replace=FALSE),]

bench <- function(searchFun2Use){
  for(i in nrow(rowsToSearchOneByOne)){
    key <- rowsToSearchOneByOne[i,]
    searchFun2Use(key$lat1,key$lng1,key$lat2,key$lng2)
  }
}

microbenchmark::microbenchmark(
  bench(searchUsingRcpp),
  bench(searchUsingDataTable),
  times=100)
#############

Benchmark result :

Unit: microseconds
                        expr      min        lq      mean   median        uq      max neval
      bench(searchUsingRcpp)  360.959  381.7585  400.4466  391.999  403.9985  665.597   100
 bench(searchUsingDataTable) 1103.034 1138.0740 1214.3008 1163.514 1224.9530 2035.828   100

Note:

I really don't think that using double as keys is a good idea... floating point values should be used to search using a certain tolerance or inside a range, not to look up for perfect match inside a map.

Is there a way to use arbitrary type of value as key in environment or named list in R?

The reason people keep asking you for a specific example is that most problems for which hash tables are the appropriate technique in Python have a good solution in R that does not involve hash tables.

That said, there are certainly times when a real hash table is useful in R, and I recommend you check out the hash package for R. It uses environments as its base but lets you do a lot of R-like vector work with them. It's efficient and I've never run into a problem with it.

Just keep in mind that if you're using hash tables a lot while working with R and your code is running slowly or is buggy, you may be able to get some mileage from figuring out a more R-like way of doing it :)

Working with dictionaries/lists to get list of keys

Yes, the list type is a good approximation. You can use names() on your list to set and retrieve the 'keys':

> foo <- vector(mode="list", length=3)
> names(foo) <- c("tic", "tac", "toe")
> foo[[1]] <- 12; foo[[2]] <- 22; foo[[3]] <- 33
> foo
$tic
[1] 12

$tac
[1] 22

$toe
[1] 33

> names(foo)
[1] "tic" "tac" "toe"
>

What are the advantages of placing data in a new.env in r?

There are advantages to this if your data is large and you have to modify it by passing it through functions. When you send data.frames or vectors to functions that modify them, R will make a copy of the data before making changes to it. You'd then return the modified data from the function and overwrite the old data to complete the modification step.

If your data is large, copying the data for each function call may result in an undesirable amount of overhead. Using environments provides a way around this overhead. environments are handled differently by functions. If you pass an environment to a function and modify the contents, R will operate directly on the environment without making a copy of it. So by putting your data in an environment and passing the environment to the function instead of directly passing the data, you can avoid copying the large dataset.

# here I create a data.frame inside an environment and pass the environment
# to a function that modifies the data.
e <- new.env()
e$k <- data.frame(a=1:3)
f <- function(e) {e$k[1,1] <- 10}
f(e)
# you can see that the original data was changed.
e$k
   a
1 10
2  2
3  3

# alternatively, if I pass just the data.frame, the manipulations do not affect the 
# original data.
k <- data.frame(a=1:3)
f2 <- function(k) {k[1,1] <- 10}
f2(k)
k
  a
1 1
2 2
3 3

Integer hash function colliding after few iterations

The reason is that your multiplication part is moving the bits out to the left, and if you have enough loop iterations the bits obtained from the first numbers in the list will eventually be thrown out completely and no longer have an effect on the final result.

The number 9176 can be written in binary as 10001111011000, and in practice the lowest 1-bit will dictate how many rounds you need to run before the first entry completely falls off the list.

The last 1-bit, is at position 3 (or the 4th position from the right), and this means you're moving the bits from the first number 4 positions to the left on every iteration. By the time you've done this 8 times, you've moved that number completely out of the 32-bit buffer (int is 32-bit).

A better method (but see my comment below) would be to at least ensure no bits are completely lost, so a different but still fairly simple way of calculating the hash code could be like this:

hashCode = ((hashCode << 27) | (hashCode >> 5)) ^ c;

This basically rotates the current hash code 27 bits to the left, and the 5 bits that fall off are rotated back in from the right, and then an exclusive OR with c bakes that into the number as well.

You should, however, use a more standardized way of calculating these hashes. My suggested change above is bound to have problems of its own, they're just not as obvious.

And really, because of the pigeon hole principle, you cannot calculate a unique number for a list of numbers, and this has nothing to do with which hash code algorithm you're using. None of them will solve this part of the problem. So I would really ask you to rethink what you're doing in the first place.