R Fast Single Item Lookup from List VS Data.Table VS Hash

R fast single item lookup from list vs data.table vs hash

For a non-vectorized access pattern, you might want to try the builtin environment objects:

require(microbenchmark)

test_lookup_env <- list2env(test_lookup_list)

x <- lookup_tests[[1]][1]
microbenchmark(
lookup_hash[[x]],
test_lookup_list[[x]],
test_lookup_dt[x],
test_lookup_env[[x]]
)

Here you can see it's even zippier than hash :

Unit: microseconds
expr min lq mean median uq max neval
lookup_hash[[x]] 10.767 12.9070 22.67245 23.2915 26.1710 68.654 100
test_lookup_list[[x]] 847.700 853.2545 887.55680 863.0060 893.8925 1369.395 100
test_lookup_dt[x] 2652.023 2711.9405 2771.06400 2758.8310 2803.9945 3373.273 100
test_lookup_env[[x]] 1.588 1.9450 4.61595 2.5255 6.6430 27.977 100

EDIT:

Stepping through data.table:::`[.data.table` is instructive why you are seeing dt slow down. When you index with a character and there is a key set, it does quite a bit of bookkeeping, then drops down into bmerge, which is a binary search. Binary search is O(log n) and will get slower as n increases.

Environments, on the other hand, use hashing (by default) and have constant access time with respect to n.

To work around, you can manually build a map and index through it:

x <- lookup_tests[[2]][2]

e <- list2env(setNames(as.list(1:nrow(test_lookup_dt)), test_lookup_dt$product_id))

#example access:
test_lookup_dt[e[[x]], ]

However, seeing so much bookkeeping code in the data.table method, I'd try out plain old data.frames as well:

test_lookup_df <- as.data.frame(test_lookup_dt)

rownames(test_lookup_df) <- test_lookup_df$product_id

If we are really paranoid, we could skip the [ methods altogether and lapply over the columns directly.

Here are some more timings (from a different machine than above):

> microbenchmark(
+ test_lookup_dt[x,],
+ test_lookup_dt[x],
+ test_lookup_dt[e[[x]],],
+ test_lookup_df[x,],
+ test_lookup_df[e[[x]],],
+ lapply(test_lookup_df, `[`, e[[x]]),
+ lapply(test_lookup_dt, `[`, e[[x]]),
+ lookup_hash[[x]]
+ )
Unit: microseconds
expr min lq mean median uq max neval
test_lookup_dt[x, ] 1658.585 1688.9495 1992.57340 1758.4085 2466.7120 2895.592 100
test_lookup_dt[x] 1652.181 1695.1660 2019.12934 1764.8710 2487.9910 2934.832 100
test_lookup_dt[e[[x]], ] 1040.869 1123.0320 1356.49050 1280.6670 1390.1075 2247.503 100
test_lookup_df[x, ] 17355.734 17538.6355 18325.74549 17676.3340 17987.6635 41450.080 100
test_lookup_df[e[[x]], ] 128.749 151.0940 190.74834 174.1320 218.6080 366.122 100
lapply(test_lookup_df, `[`, e[[x]]) 18.913 25.0925 44.53464 35.2175 53.6835 146.944 100
lapply(test_lookup_dt, `[`, e[[x]]) 37.483 50.4990 94.87546 81.2200 124.1325 241.637 100
lookup_hash[[x]] 6.534 15.3085 39.88912 49.8245 55.5680 145.552 100

Overall, to answer your questions, you are not using data.table "wrong" but you are also not using it in the way it was intended (vectorized access). However, you can manually build a map to index through and get most of the performance back.

fast R lookup table

I finally broke down and made this a more generic function:

set.seed(0); K <- 1000; M <- K*K
rint <- function( n, minv=0, maxv=NA ) sample( minv:(if (is.na(maxv)) n else maxv), n, repl=T )

dict.lookup <- function( dwords, dictionary, by=NULL, by.w=NULL, by.d=NULL ) {
# bad style, but just (mostly symmetric) error checking
if (is.null(by.d)) by.d <- by; if (is.null(by.w)) by.w <- by
stopifnot( (!is.null(by.d)) & (!is.null(by.w)))
# valid input checking
stopifnot( is.data.frame( dwords ) ); stopifnot( is.data.frame( dictionary ) )
stopifnot( nrow( dwords ) > 0 ); stopifnot( nrow( dictionary ) > 0 )
stopifnot( is.character(by.w) ); stopifnot( is.character(by.d) )
stopifnot( length(by.w)==1 ); stopifnot( length(by.d)==1)
stopifnot( by.w %in% names(dwords) ); stopifnot( by.d %in% names(dictionary) )
# you cannot give the words directly. hash them first
stopifnot( is.numeric( dwords[[by.w]] ) )
stopifnot( is.numeric( dictionary[[by.d]] ) )
# a dictionary should have only unique entries
stopifnot( anyDuplicated( dictionary[[by.d]] ) == 0 )

# the actual work
toright <- dictionary[ match(dwords[[by.w]], dictionary[[by.d]]), ]
cbind(dwords, toright[ , names(toright) != by.d ])
}

L <- 100*M ## only 100M entries for 1GB*4 of int data
dictionary <- data.frame( idictwords=sample(1:L), cost2print=rint(L, 1,100), tiresomeness=rint(L, 100,200) )
message("created dictionary")

## look up 10+1 words
dwords <- data.frame( imywords= c(dictionary[ sample(1:L, 10) , "idictwords"], -99), frombook=rint(11, 200,300) )
message("created my words")

print( system.time( o <- dict.lookup( dwords, dictionary, by.w= "imywords", by.d= "idictwords" ) ) )
message("looked up my words in dictionary done")

print(o)

gives me

   user  system elapsed 
13.746 0.739 14.489

imywords frombook cost2print tiresomeness
68533657 88509161 263 25 110
87030862 75614422 297 23 164
16923080 79185053 249 84 105
62235248 84542527 292 72 141
4044547 35212219 201 13 155
95995528 67895828 257 4 122
43031831 24227004 281 86 101
76602707 55164521 270 52 151
53380001 87665273 207 35 121
24278223 30085231 238 6 153
NA -99 205 NA NA

the rownames are the matching rows from the dictionary data frame.

I often tinker with functions (often to get better handling). feel free to suggest changes.

Fast data.table assign of multiple columns by group from lookup

You can also use lapply:

cols <- noquote(paste0("value_",1:10))

random[lookup, (cols) := lapply (cols, function(x) get(x) * get(paste0("i.", x))), by = .EACHI ]

In case your dataset is too big and you want to see a progress bar of your operation, you can use pblapply:

library(pbapply)

random[lookup, (cols) := pblapply(cols, function(x) get(x) * get(paste0("i.", x))), by = .EACHI ]

Which is faster, Hash lookup or Binary search?

Ok, I'll try to be short.

C# short answer:

Test the two different approaches.

.NET gives you the tools to change your approach with a line of code.
Otherwise use System.Collections.Generic.Dictionary and be sure to initialize it with a large number as initial capacity or you'll pass the rest of your life inserting items due to the job GC has to do to collect old bucket arrays.

Longer answer:

An hashtable has ALMOST constant lookup times and getting to an item in an hash table in the real world does not just require to compute an hash.

To get to an item, your hashtable will do something like this:

  • Get the hash of the key
  • Get the bucket number for that hash (usually the map function looks like this bucket = hash % bucketsCount)
  • Traverse the items chain (basically it's a list of items that share
    the same bucket, most hashtables use
    this method of handling bucket/hash
    collisions) that starts at that
    bucket and compare each key with the
    one of the item you are trying to
    add/delete/update/check if
    contained.

Lookup times depend on how "good" (how sparse is the output) and fast is your hash function, the number of buckets you are using and how fast is the keys comparer, it's not always the best solution.

A better and deeper explanation: http://en.wikipedia.org/wiki/Hash_table

R speed of data.table

data.table is fast for doing lookups and manipulations in very large tables of data, but it's not going to be fast at adding rows one by one like python dictionaries. I'd expect it would be copying the whole table each time you add a row which is clearly not what you want.

You can either try to use environments (which are something like a hashmap), or if you really want to do this in R you may need a specialist package, here's a link to an answer with a few options.

Can Set be faster than Hashtable?

Can Set be faster than Hashtable?

A "set" is any container that tracks distinct keys, supporting operations "insert(Key)", "erase(Key)", and "has(Key)".

A hash table can be used to do this - e.g. C++'s std::unordered_set.

A balanced binary tree can be used to do this - e.g. C++'s std::set.

While not efficient for more than a few tens or hundreds of Keys, contiguous memory (i.e. an array or std::vector) can store sorted elements for fairly fast (but still O(log N)) binary search lookup, with much slower insertion and erasure. This can be optimal sometimes because it's CPU cache friendly. But, because it quickly gets inefficient, C++ doesn't provide a contiguous-storage container with those three set operations (insert(Key), erase(key), find(key) as above)... you'd have to download an extra library or write your own. C++ does have std::binary_search, std::lower_bound and std::upper_bound which will make it easy to implement.

Returning to your question:

Can Set be faster than Hashtable?

They're not necessarily distinct things. A set may be implemented using a hashtable, and sometimes that may be the best choice (particularly when there are a lot of keys, it's not expensive to hash the keys well enough to avoid excessive collisions, and it's cheap to compare keys. Even then, there are different types of hash tables, and sometimes unordered_set won't be the fastest - you can google "fastest C++ hash table" and do some reading, e.g. https://engineering.fb.com/2019/04/25/developer-tools/f14/

Sometimes those other implementation choices - balanced binary trees or contiguous memory - will be faster than using a hash table.

Balanced binary trees tend to work well when there's a middling (few thousand to a million say) elements and it's cheap to work out whether one key is "less than" another key.

Turning to your other questions / assertions:

Sets are typically implemented as hashtables where keys are set members and values are some constant (True or something).

Wrong: sets only store keys, they don't have "values" (whether boolean or otherwise). If you want to keys and values, then that's called a map. Maps can be implemented using all the same data structures as sets (hash tables, binary trees, contiguous storage), with the same performance compromises. Maps are easily implemented akin to set<std::pair<Key, Value>>, where the needed lookup operations (hash, equals and or less-than) only consider the key.

I think I've adequately addressed your other speculation / questions....



Related Topics



Leave a reply



Submit