Keyed Lookup on Data.Table Without 'With'

Keyed lookup on data.table without 'with'

There is an item in the NEWS for 1.8.2 that suggests a ..() syntax will be added at some point, allowing this

New DT[.(...)] syntax (in the style of package plyr) is identical to
DT[list(...)], DT[J(...)] and DT[data.table(...)]. We plan to add ..(), too, so
that .() and ..() are analogous to the file system's ./ and ../; i.e., .()
evaluates within the frame of DT and ..() in the parent scope.

In the mean time, you can get from the appropriate environment

dt[J(get('x', envir = parent.frame(3)))]
## x y
## 1: 3 5
## 2: 4 6

or you could eval the whole call to list(x) or J(x)

dt[eval(list(x))]
dt[eval(J(x))]
dt[eval(.(x))]

variable usage in data.table

Since you have two variables named b, one inside DT and one outside the scope of DT, we have to go and get b <- 7 from the global environment. We can do that with get().

DT[b == get("b", globalenv())]
# ID a b c
# 1: b 1 7 13

Update: You mention in the comments that the variables are inside a function environment. In that case, you can use parent.frame() instead of globalenv().

f <- function(b, dt) dt[b == get("b", parent.frame(3))] 

f(7, DT)
# ID a b c
# 1: b 1 7 13
f(12, DT)
# ID a b c
# 1: c 6 12 18

Using a key to replace values across a whole data.table

You can use melt and dcast:

dcast(
rating[melt(df, id=c("V1", "V2"),value.name = "Rating"), on="Rating"],
V1+V2~variable, value.var = "CreditQuality"
)

Output:

             V1              V2 V3 V4 V5 V6 V7 V8 V9
1: XS0041971275 TR.IssuerRating 1 1 1 1 2 2 1
2: XS0043098127 TR.IssuerRating 6 6 6 6 6 6 6
3: XS0285400197 TR.IssuerRating 2 2 2 2 2 2 2

Note: I'm assuming your source data is df, and your Rating data is rating. I see that your frames are already of class data.table

How to do an X[Y] data.table join, without losing an existing main key on X?

With secondary keys implemented (since v1.9.6) and the recent bug fix on retaining/discarding keys properly (in v1.9.7), you can now do this using on=:

# join
DT[x2y, on="x"] # key is removed as row order gets changed.

# update using joins
DT[x2y, y:=y, on="x"] # key is retained, as row order isn't changed.

data.table - subsetting based on variable whose name is a column, too

Data.table runs in the environment of the data table itself right, so you might need to specify where you want to get the value from

DT[cyl == get("cyl", envir = parent.frame())]

How to efficiently add a date column from a lookup table, without plyr?

When the columns you want to merge by do not have the same name in both data frames, you need to specify how they should line up. In merge, this is done with the by argument. You can also use data table's [ syntax for merging, in which uses an on argument.

Whether or not you set key, either of these will work:

merge(proc, allo, by.x = "Pseudonym", by.y = "pseudonym")
proc[allo, on = .(Pseudonym = pseudonym)]

So, what does setkey do? Most importantly, it will speed up any merges involving key columns. As far as the merge defaults, we can look at ?data.table::merge, which begins:

...by default, it attempts to merge

  • at first based on the shared key columns, and if there are none,

  • then based on key columns of the first argument x, and if there are none,

  • then based on the common columns between the two data.tables.


Set the by, or by.x and by.y arguments explicitly to override this default.

This is different than base::merge, in that base::merge will always try to merge on all shared columns. data.table::merge will prioritize shared columns that are keyed to merge on. None of these will attempt to merge columns with different names.

data.table := assignments when variable has same name as a column

You can always use get, which allows you to specify the environment:

dt1[1, a := get("a", envir = .GlobalEnv)]
# a
#1: 18

Or just:

a <- 42
dt1[1, a := .GlobalEnv$a]
# a
#1: 42

Chaining factors for lookup - is this the most efficient way?

You can do a lot better if you use lookup vectors instead of lookup lists. Basically, I changed list to c(), and then cut out all the as.character bits.

vState <- c("A" = "Alaska", "T" = "Texas", "G" = "Georgia")    
vCap <- c("Alaska" = "Juneau", "Texas" = "Austin", "Georgia" = "Atlanta")

vCap[vState[foo]]

Benchmarking methods so far:

microbenchmark::microbenchmark(
recode = foo %>%
dplyr::recode(!!!iState, .default = NA_character_) %>%
dplyr::recode(!!!sCap, .default = NA_character_),
lists = sCap[iState[foo] %>% as.character() %>% na_if("NULL") ] %>% as.character() %>% na_if("NULL"),
lists_no_pipe = na_if(as.character(sCap[na_if(as.character(iState[foo]), "NULL")]), "NULL"),
vectors = unname(vCap[vState[foo]])
)
# Unit: microseconds
# expr min lq mean median uq max neval
# recode 227.1 244.05 305.203 268.05 319.55 591.1 100
# lists 182.2 198.85 244.964 222.10 254.20 562.6 100
# lists_no_pipe 11.4 13.25 17.726 15.45 18.70 64.5 100
# vectors 2.5 3.85 5.269 4.90 6.40 12.9 100

If you want things to be as fast as possible, don't use %>% - it's extra overhead. If you are doing complicated things, the extra microseconds from piping don't really matter. But in this case, the operations being done are already so quick that the few microseconds of piping actually account for a significant percentage of the execution time.

You may be able to go even faster--especially if your look-up tables are large, by using a join to a keyed data.table instead.



Related Topics



Leave a reply



Submit