Data.Table Join Then Add Columns to Existing Data.Frame Without Re-Copy

data.table join then add columns to existing data.frame without re-copy

This is easy to do:

X[Y, z := i.z]

It works because the only difference between Y[X] and X[Y] here, is when some elements are not in Y, in which case presumably you'd want z to be NA, which the above assignment will exactly do.

It would also work just as well for many variables:

X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]

Since you require the operation Y[X], you can add the argument nomatch=0 (as @mnel points out) so as to not get NAs for those where X doesn't contain the key values from Y. That is:

X[Y, z := i.z, nomatch=0]

From the NEWS for data.table

    **********************************************
    **                                          **
    **   CHANGES IN DATA.TABLE VERSION 1.7.10   **
    **                                          **
    **********************************************

NEW FEATURES

o   The prefix i. can now be used in j to refer to join inherited
    columns of i that are otherwise masked by columns in x with
    the same name.

r - data.table join and then add all columns from one table to another

Just create a function that takes names as arguments and constructs the expression for you. And then eval it each time by passing the names of each data.table you require. Here's an illustration:

get_expr <- function(x) {
    # 'x' is the names vector
    expr = paste0("i.", x)
    expr = lapply(expr, as.name)
    setattr(expr, 'names', x)
    as.call(c(quote(`:=`), expr))
}

> get_expr('value')    ## generates the required expression
# `:=`(value = i.value)

template[x, eval(get_expr("value"))]
template[y, eval(get_expr("value"))]

#     id1 id2       value
#  1:   a   1          NA
#  2:   a   2  0.01649728
#  3:   a   3 -0.27918482
#  4:   a   4 -1.16343900
#  5:   a   5          NA
#  6:   b   1          NA
#  7:   b   2          NA
#  8:   b   3  0.86933718
#  9:   b   4  2.26787200
# 10:   b   5  1.08325800

How can I add additional columns to an existing data.frame, that are aligned on one specific column already in the data.frame?

All right, my best guess is your data looks something like this (though probably much bigger):

library(data.table)
set.seed(47)
nn_data_sample = data.table(
  yearbuilt = rep(c(1938, 1942, 1951, 1963), each = 4),
  ZIP = sample(c(90210, 19145, 19146, 19147, 19148, 19149), size = 16, replace = TRUE)
)
nn_data_sample
 #    yearbuilt   ZIP
 # 1:      1938 19149
 # 2:      1938 19146
 # 3:      1938 19148
 # 4:      1938 19148
 # 5:      1942 19147
 # 6:      1942 19148
 # 7:      1942 19146
 # 8:      1942 19146
 # 9:      1951 19147

This is nicely formatted data, in long format, which is easy to work with. You seem to want to (a) count rows by zipcode and by the decade they were built (more-or-less, with a little more granularity recently), and then (b) convert the long data (with one zipcode column and one time column) into a wide format, where the times are spread across many columns.

For (a), we will use the cut function to divide the years into the decade-like intervals you want, and then aggregate the rows by zip code and decade.

decade_data = nn_data_sample[, decade_built := cut(
  yearbuilt,
  breaks = c(0, seq(1939, 1999, by = 10), 2004, Inf))
][, .(n = .N), by = .(decade_built, ZIP)]

decade_data
 #    decade_built   ZIP n
 # 1:     (0,1939] 19149 1
 # 2:     (0,1939] 19146 1
 # 3:     (0,1939] 19148 2
 # 4:  (1939,1949] 19147 1
 # 5:  (1939,1949] 19148 1
 # 6:  (1939,1949] 19146 2
 # 7:  (1949,1959] 19147 1
 # 8:  (1949,1959] 19149 1
 # ...

For a lot of use cases, this is a great format to work with---data.table makes it easy to do things "by group", so if you have more operations you want to do to each decade, this should be your starting point. (Since we used := the decade_built column became part of the original data, you can look at it to verify that it worked right.)

But, if you want to change to wide format, dcast does that for us:

dcast(decade_data, ZIP ~ decade_built, value.var = "n")
#      ZIP (0,1939] (1939,1949] (1949,1959] (1959,1969]
# 1: 19146        1           2          NA          NA
# 2: 19147       NA           1           1           2
# 3: 19148        2           1           1          NA
# 4: 19149        1          NA           1           1
# 5: 90210       NA          NA           1           1

If you want to edit the column names, you can either specify what you want from the top, using the labels argument of the cut function, or simply rename the columns at the end. Or do it in the middle, modifying the values of the decade_built column after it's created---do it wherever feels easiest.

How to conditionally replace R data.table columns upon merge?

We can use the on based approach

dt1[dt2, column1 := i.column1, on = .(index_column)]
dt1
#   index_column  column1 column2
#1:           12      dog     482
#2:           17      cat     391
#3:           29  penguin     567
#4:           34 elephant     182
#5:           46     bird     121

How to take data from one dataframe and copy it into existing columns in another dataframe based on the shared ID of a third column

We cannot use the original dataset 'df1' columns after the join becuase it is a left_join. In tidyverse, we specify the unquoted column names. There is no all.x argument in left_join. It should be from merge

library(dplyr)
left_join(x=df1, y=df2, by = "color.band") %>% 
     mutate(y = ifelse(is.na(color.band), bandnum, color.band))

Adding columns to inner join (data.table)

To select columns while merging two data.tables (like you are doing), you should not use a dollar symbol. You can prefix the names of the columns in A and B (in the merge of the A[B]) by x. and i. respectively (see below).

What you're missing is that, in your examples, you are selecting the columns in the original dataset (which has 3 rows) and not in the (inner)joined dataset which has 2 rows.

A[B, .(A.id = x.id), on=.(id), nomatch=NULL]  # prefixing id with x. select id from A
#     A.id
# 1:     1
# 2:     2
A[B, .(i.id), on=.(id), nomatch=NULL]         # prefixing id with i. select id from B
#     i.id
# 1:     1
# 2:     2
A[B, .(A.id = x.id, B.id= i.id,  A.x_val = x.x_val, i.y_val), on=.(id), nomatch=NULL]
#     A.id  B.id A.x_val i.y_val
# 1:     1     1      x1      y1
# 2:     2     2      x2      y2

How to do an X[Y] data.table join, without losing an existing main key on X?

With secondary keys implemented (since v1.9.6) and the recent bug fix on retaining/discarding keys properly (in v1.9.7), you can now do this using on=:

# join
DT[x2y, on="x"] # key is removed as row order gets changed.

# update using joins
DT[x2y, y:=y, on="x"] # key is retained, as row order isn't changed.

Join then mutate using data.table without intermediate table

It's more efficient to simply add columns to DFI (in an "update join"), rather than making a new table:

DFI[DF_Lookup, on=.(PO_ID, SO_ID, F_Year, Product_ID), 
  `:=`(newrev = i.Revenue/.N, newqty = i.Quantity/.N)
, by=.EACHI]

    PO_ID SO_ID F_Year Product_ID Revenue Quantity Location1    newrev newqty
 1: P1234    S1   2012       385X       1        1        MA  16.66667      1
 2: P1234    S1   2012       385X       2        2        NY  16.66667      1
 3: P1234    S1   2012       385X       3        3        WA  16.66667      1
 4: P1234    S2   2013       450X      34        8        NY  35.00000     10
 5: P1234    S2   2013       450X      34        8        WA  35.00000     10
 6: P1234    S2   2013       900X       6        6        NY  35.00000     20
 7: P2345    S3   2011       3700       7        7        IL 100.00000     20
 8: P2345    S4   2011       3700      88        8        IL -50.00000    -10
 9: P3456    S7   2014       A11U       9        9        MN  50.00000     20
10: P4567   S10   2015       2700     100       40        CA 100.00000     40

This is a pretty natural extension of the Q&A linked in the OP.

The by=.EACHI groups by each row of i in x[i,on=,j]; and .N is how many rows the group has.

If you want the rev and qty cols overwritten, use `:=`(Revenue = i.Revenue/.N, Quantity = i.Quantity/.N).

Copy only one variable from one R data.table to another after matching on a variable

If I understand correctly, the OP wants to append column z from DT1 to DT2 as column y where the id columns match.

With data.table, this can be solved using an update join:

library(data.table)
DT2[DT1, on = .(id2 = id1), y := i.z]
DT2

   id2 x  z  y
1:   1 E 21 NA
2:   2 F 22 11
3:   3 G 23 12
4:   4 H 24 13

Note that DT2 is updated by reference, i.e., without copying the whole data object. This might be handy for OP's large production datasets of millions of rows.

Caveat

This works because id1 and id2 are unique which is the case for the sample use case. So, make sure that you get what you want when you do update joins on duplicate values.

Let's see what will happen if there are duplicate values in the id1 column, e.g.

In case DT1 has id1 == 4 duplicated

(DT1 <- data.table(id1 = c(2:4, 4), x = LETTERS[1:4], z = 11:14))

   id1 x  z
1:   2 A 11
2:   3 B 12
3:   4 C 13
4:   4 D 14

then

DT2[DT1, on = .(id2 = id1), y := i.z][]

returns

   id2 x  z  y
1:   1 E 21 NA
2:   2 F 22 11
3:   3 G 23 12
4:   4 H 24 14

So, the update join

has not created additional rows in DT2 (which is probably what you may want to avoid copying of large datasets),
has picked the last occurrence of z in case of multiple matches.

Data.Table Join Then Add Columns to Existing Data.Frame Without Re-Copy