R Fast Single Item Lookup from List VS Data.Table VS Hash

R fast single item lookup from list vs data.table vs hash

For a non-vectorized access pattern, you might want to try the builtin environment objects:

require(microbenchmark)

test_lookup_env <- list2env(test_lookup_list)

x <- lookup_tests[[1]][1]
microbenchmark(
    lookup_hash[[x]],
    test_lookup_list[[x]],
    test_lookup_dt[x],
    test_lookup_env[[x]]
)

Here you can see it's even zippier than hash :

Unit: microseconds
                  expr      min        lq       mean    median        uq      max neval
      lookup_hash[[x]]   10.767   12.9070   22.67245   23.2915   26.1710   68.654   100
 test_lookup_list[[x]]  847.700  853.2545  887.55680  863.0060  893.8925 1369.395   100
     test_lookup_dt[x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273   100
  test_lookup_env[[x]]    1.588    1.9450    4.61595    2.5255    6.6430   27.977   100

EDIT:

Stepping through data.table:::`[.data.table` is instructive why you are seeing dt slow down. When you index with a character and there is a key set, it does quite a bit of bookkeeping, then drops down into bmerge, which is a binary search. Binary search is O(log n) and will get slower as n increases.

Environments, on the other hand, use hashing (by default) and have constant access time with respect to n.

To work around, you can manually build a map and index through it:

x <- lookup_tests[[2]][2]

e <- list2env(setNames(as.list(1:nrow(test_lookup_dt)), test_lookup_dt$product_id))

#example access:
test_lookup_dt[e[[x]], ]

However, seeing so much bookkeeping code in the data.table method, I'd try out plain old data.frames as well:

test_lookup_df <- as.data.frame(test_lookup_dt)

rownames(test_lookup_df) <- test_lookup_df$product_id

If we are really paranoid, we could skip the [ methods altogether and lapply over the columns directly.

Here are some more timings (from a different machine than above):

> microbenchmark(
+   test_lookup_dt[x,],
+   test_lookup_dt[x],
+   test_lookup_dt[e[[x]],],
+   test_lookup_df[x,],
+   test_lookup_df[e[[x]],],
+   lapply(test_lookup_df, `[`, e[[x]]),
+   lapply(test_lookup_dt, `[`, e[[x]]),
+   lookup_hash[[x]]
+ )
Unit: microseconds
                                expr       min         lq        mean     median         uq       max neval
                 test_lookup_dt[x, ]  1658.585  1688.9495  1992.57340  1758.4085  2466.7120  2895.592   100
                   test_lookup_dt[x]  1652.181  1695.1660  2019.12934  1764.8710  2487.9910  2934.832   100
            test_lookup_dt[e[[x]], ]  1040.869  1123.0320  1356.49050  1280.6670  1390.1075  2247.503   100
                 test_lookup_df[x, ] 17355.734 17538.6355 18325.74549 17676.3340 17987.6635 41450.080   100
            test_lookup_df[e[[x]], ]   128.749   151.0940   190.74834   174.1320   218.6080   366.122   100
 lapply(test_lookup_df, `[`, e[[x]])    18.913    25.0925    44.53464    35.2175    53.6835   146.944   100
 lapply(test_lookup_dt, `[`, e[[x]])    37.483    50.4990    94.87546    81.2200   124.1325   241.637   100
                    lookup_hash[[x]]     6.534    15.3085    39.88912    49.8245    55.5680   145.552   100

Overall, to answer your questions, you are not using data.table "wrong" but you are also not using it in the way it was intended (vectorized access). However, you can manually build a map to index through and get most of the performance back.

fast R lookup table

I finally broke down and made this a more generic function:

set.seed(0); K <- 1000; M <- K*K
rint <- function( n, minv=0, maxv=NA ) sample( minv:(if (is.na(maxv)) n else maxv), n, repl=T )

dict.lookup <- function( dwords, dictionary, by=NULL, by.w=NULL, by.d=NULL ) {
                                        # bad style, but just (mostly symmetric) error checking
    if (is.null(by.d)) by.d <- by; if (is.null(by.w)) by.w <- by
    stopifnot( (!is.null(by.d)) & (!is.null(by.w)))
                                        # valid input checking
    stopifnot( is.data.frame( dwords ) ); stopifnot( is.data.frame( dictionary ) )
    stopifnot( nrow( dwords ) > 0 ); stopifnot( nrow( dictionary ) > 0 )
    stopifnot( is.character(by.w) ); stopifnot( is.character(by.d) )
    stopifnot( length(by.w)==1 ); stopifnot( length(by.d)==1)
    stopifnot( by.w %in% names(dwords) ); stopifnot( by.d %in% names(dictionary) )
                                        # you cannot give the words directly.  hash them first
    stopifnot( is.numeric( dwords[[by.w]] ) )
    stopifnot( is.numeric( dictionary[[by.d]] ) )
                                        # a dictionary should have only unique entries
    stopifnot( anyDuplicated( dictionary[[by.d]] ) == 0 )

                                        # the actual work
    toright <- dictionary[ match(dwords[[by.w]], dictionary[[by.d]]), ]
    cbind(dwords, toright[ , names(toright) != by.d ])
}

L <- 100*M  ## only 100M entries for 1GB*4 of int data
dictionary  <- data.frame( idictwords=sample(1:L), cost2print=rint(L, 1,100),  tiresomeness=rint(L, 100,200) )
message("created dictionary")

## look up 10+1 words
dwords <- data.frame( imywords= c(dictionary[ sample(1:L, 10) , "idictwords"], -99),  frombook=rint(11, 200,300) )
message("created my words")

print( system.time( o <- dict.lookup( dwords, dictionary, by.w= "imywords", by.d= "idictwords" ) ) )
message("looked up my words in dictionary done")

print(o)

gives me

   user  system elapsed 
 13.746   0.739  14.489 

         imywords frombook cost2print tiresomeness
68533657 88509161      263         25          110
87030862 75614422      297         23          164
16923080 79185053      249         84          105
62235248 84542527      292         72          141
4044547  35212219      201         13          155
95995528 67895828      257          4          122
43031831 24227004      281         86          101
76602707 55164521      270         52          151
53380001 87665273      207         35          121
24278223 30085231      238          6          153
NA            -99      205         NA           NA

the rownames are the matching rows from the dictionary data frame.

I often tinker with functions (often to get better handling). feel free to suggest changes.

Fast data.table assign of multiple columns by group from lookup

You can also use lapply:

cols <- noquote(paste0("value_",1:10))

random[lookup, (cols) := lapply (cols, function(x)  get(x) * get(paste0("i.", x))), by = .EACHI ]

In case your dataset is too big and you want to see a progress bar of your operation, you can use pblapply:

library(pbapply)

random[lookup, (cols) := pblapply(cols, function(x)  get(x) * get(paste0("i.", x))), by = .EACHI ]

Which is faster, Hash lookup or Binary search?

Ok, I'll try to be short.

C# short answer:

Test the two different approaches.

.NET gives you the tools to change your approach with a line of code.
Otherwise use System.Collections.Generic.Dictionary and be sure to initialize it with a large number as initial capacity or you'll pass the rest of your life inserting items due to the job GC has to do to collect old bucket arrays.

Longer answer:

An hashtable has ALMOST constant lookup times and getting to an item in an hash table in the real world does not just require to compute an hash.

To get to an item, your hashtable will do something like this:

Get the hash of the key
Get the bucket number for that hash (usually the map function looks like this bucket = hash % bucketsCount)
Traverse the items chain (basically it's a list of items that share
the same bucket, most hashtables use
this method of handling bucket/hash
collisions) that starts at that
bucket and compare each key with the
one of the item you are trying to
add/delete/update/check if
contained.

Lookup times depend on how "good" (how sparse is the output) and fast is your hash function, the number of buckets you are using and how fast is the keys comparer, it's not always the best solution.

A better and deeper explanation: http://en.wikipedia.org/wiki/Hash_table

R speed of data.table

data.table is fast for doing lookups and manipulations in very large tables of data, but it's not going to be fast at adding rows one by one like python dictionaries. I'd expect it would be copying the whole table each time you add a row which is clearly not what you want.

You can either try to use environments (which are something like a hashmap), or if you really want to do this in R you may need a specialist package, here's a link to an answer with a few options.

Can Set be faster than Hashtable?

Can Set be faster than Hashtable?

A "set" is any container that tracks distinct keys, supporting operations "insert(Key)", "erase(Key)", and "has(Key)".

A hash table can be used to do this - e.g. C++'s std::unordered_set.

A balanced binary tree can be used to do this - e.g. C++'s std::set.

While not efficient for more than a few tens or hundreds of Keys, contiguous memory (i.e. an array or std::vector) can store sorted elements for fairly fast (but still O(log N)) binary search lookup, with much slower insertion and erasure. This can be optimal sometimes because it's CPU cache friendly. But, because it quickly gets inefficient, C++ doesn't provide a contiguous-storage container with those three set operations (insert(Key), erase(key), find(key) as above)... you'd have to download an extra library or write your own. C++ does have std::binary_search, std::lower_bound and std::upper_bound which will make it easy to implement.

Returning to your question:

Can Set be faster than Hashtable?

They're not necessarily distinct things. A set may be implemented using a hashtable, and sometimes that may be the best choice (particularly when there are a lot of keys, it's not expensive to hash the keys well enough to avoid excessive collisions, and it's cheap to compare keys. Even then, there are different types of hash tables, and sometimes unordered_set won't be the fastest - you can google "fastest C++ hash table" and do some reading, e.g. https://engineering.fb.com/2019/04/25/developer-tools/f14/

Sometimes those other implementation choices - balanced binary trees or contiguous memory - will be faster than using a hash table.

Balanced binary trees tend to work well when there's a middling (few thousand to a million say) elements and it's cheap to work out whether one key is "less than" another key.

Turning to your other questions / assertions:

Sets are typically implemented as hashtables where keys are set members and values are some constant (True or something).

Wrong: sets only store keys, they don't have "values" (whether boolean or otherwise). If you want to keys and values, then that's called a map. Maps can be implemented using all the same data structures as sets (hash tables, binary trees, contiguous storage), with the same performance compromises. Maps are easily implemented akin to set<std::pair<Key, Value>>, where the needed lookup operations (hash, equals and or less-than) only consider the key.

I think I've adequately addressed your other speculation / questions....

R Fast Single Item Lookup from List VS Data.Table VS Hash