Data.Table Join Then Add Columns to Existing Data.Frame Without Re-Copy

data.table join then add columns to existing data.frame without re-copy

This is easy to do:

X[Y, z := i.z]

It works because the only difference between Y[X] and X[Y] here, is when some elements are not in Y, in which case presumably you'd want z to be NA, which the above assignment will exactly do.

It would also work just as well for many variables:

X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]

Since you require the operation Y[X], you can add the argument nomatch=0 (as @mnel points out) so as to not get NAs for those where X doesn't contain the key values from Y. That is:

X[Y, z := i.z, nomatch=0]

From the NEWS for data.table

    **********************************************
** **
** CHANGES IN DATA.TABLE VERSION 1.7.10 **
** **
**********************************************

NEW FEATURES

o   The prefix i. can now be used in j to refer to join inherited
columns of i that are otherwise masked by columns in x with
the same name.

r - data.table join and then add all columns from one table to another

Just create a function that takes names as arguments and constructs the expression for you. And then eval it each time by passing the names of each data.table you require. Here's an illustration:

get_expr <- function(x) {
# 'x' is the names vector
expr = paste0("i.", x)
expr = lapply(expr, as.name)
setattr(expr, 'names', x)
as.call(c(quote(`:=`), expr))
}

> get_expr('value') ## generates the required expression
# `:=`(value = i.value)

template[x, eval(get_expr("value"))]
template[y, eval(get_expr("value"))]

# id1 id2 value
# 1: a 1 NA
# 2: a 2 0.01649728
# 3: a 3 -0.27918482
# 4: a 4 -1.16343900
# 5: a 5 NA
# 6: b 1 NA
# 7: b 2 NA
# 8: b 3 0.86933718
# 9: b 4 2.26787200
# 10: b 5 1.08325800

How can I add additional columns to an existing data.frame, that are aligned on one specific column already in the data.frame?

All right, my best guess is your data looks something like this (though probably much bigger):

library(data.table)
set.seed(47)
nn_data_sample = data.table(
yearbuilt = rep(c(1938, 1942, 1951, 1963), each = 4),
ZIP = sample(c(90210, 19145, 19146, 19147, 19148, 19149), size = 16, replace = TRUE)
)
nn_data_sample
# yearbuilt ZIP
# 1: 1938 19149
# 2: 1938 19146
# 3: 1938 19148
# 4: 1938 19148
# 5: 1942 19147
# 6: 1942 19148
# 7: 1942 19146
# 8: 1942 19146
# 9: 1951 19147

This is nicely formatted data, in long format, which is easy to work with. You seem to want to (a) count rows by zipcode and by the decade they were built (more-or-less, with a little more granularity recently), and then (b) convert the long data (with one zipcode column and one time column) into a wide format, where the times are spread across many columns.

For (a), we will use the cut function to divide the years into the decade-like intervals you want, and then aggregate the rows by zip code and decade.

decade_data = nn_data_sample[, decade_built := cut(
yearbuilt,
breaks = c(0, seq(1939, 1999, by = 10), 2004, Inf))
][, .(n = .N), by = .(decade_built, ZIP)]

decade_data
# decade_built ZIP n
# 1: (0,1939] 19149 1
# 2: (0,1939] 19146 1
# 3: (0,1939] 19148 2
# 4: (1939,1949] 19147 1
# 5: (1939,1949] 19148 1
# 6: (1939,1949] 19146 2
# 7: (1949,1959] 19147 1
# 8: (1949,1959] 19149 1
# ...

For a lot of use cases, this is a great format to work with---data.table makes it easy to do things "by group", so if you have more operations you want to do to each decade, this should be your starting point. (Since we used := the decade_built column became part of the original data, you can look at it to verify that it worked right.)

But, if you want to change to wide format, dcast does that for us:

dcast(decade_data, ZIP ~ decade_built, value.var = "n")
# ZIP (0,1939] (1939,1949] (1949,1959] (1959,1969]
# 1: 19146 1 2 NA NA
# 2: 19147 NA 1 1 2
# 3: 19148 2 1 1 NA
# 4: 19149 1 NA 1 1
# 5: 90210 NA NA 1 1

If you want to edit the column names, you can either specify what you want from the top, using the labels argument of the cut function, or simply rename the columns at the end. Or do it in the middle, modifying the values of the decade_built column after it's created---do it wherever feels easiest.

How to conditionally replace R data.table columns upon merge?

We can use the on based approach

dt1[dt2, column1 := i.column1, on = .(index_column)]
dt1
# index_column column1 column2
#1: 12 dog 482
#2: 17 cat 391
#3: 29 penguin 567
#4: 34 elephant 182
#5: 46 bird 121

How to take data from one dataframe and copy it into existing columns in another dataframe based on the shared ID of a third column

We cannot use the original dataset 'df1' columns after the join becuase it is a left_join. In tidyverse, we specify the unquoted column names. There is no all.x argument in left_join. It should be from merge

library(dplyr)
left_join(x=df1, y=df2, by = "color.band") %>%
mutate(y = ifelse(is.na(color.band), bandnum, color.band))

Adding columns to inner join (data.table)

To select columns while merging two data.tables (like you are doing), you should not use a dollar symbol. You can prefix the names of the columns in A and B (in the merge of the A[B]) by x. and i. respectively (see below).

What you're missing is that, in your examples, you are selecting the columns in the original dataset (which has 3 rows) and not in the (inner)joined dataset which has 2 rows.

A[B, .(A.id = x.id), on=.(id), nomatch=NULL]  # prefixing id with x. select id from A
# A.id
# 1: 1
# 2: 2
A[B, .(i.id), on=.(id), nomatch=NULL] # prefixing id with i. select id from B
# i.id
# 1: 1
# 2: 2
A[B, .(A.id = x.id, B.id= i.id, A.x_val = x.x_val, i.y_val), on=.(id), nomatch=NULL]
# A.id B.id A.x_val i.y_val
# 1: 1 1 x1 y1
# 2: 2 2 x2 y2

How to do an X[Y] data.table join, without losing an existing main key on X?

With secondary keys implemented (since v1.9.6) and the recent bug fix on retaining/discarding keys properly (in v1.9.7), you can now do this using on=:

# join
DT[x2y, on="x"] # key is removed as row order gets changed.

# update using joins
DT[x2y, y:=y, on="x"] # key is retained, as row order isn't changed.

Join then mutate using data.table without intermediate table

It's more efficient to simply add columns to DFI (in an "update join"), rather than making a new table:

DFI[DF_Lookup, on=.(PO_ID, SO_ID, F_Year, Product_ID), 
`:=`(newrev = i.Revenue/.N, newqty = i.Quantity/.N)
, by=.EACHI]

PO_ID SO_ID F_Year Product_ID Revenue Quantity Location1 newrev newqty
1: P1234 S1 2012 385X 1 1 MA 16.66667 1
2: P1234 S1 2012 385X 2 2 NY 16.66667 1
3: P1234 S1 2012 385X 3 3 WA 16.66667 1
4: P1234 S2 2013 450X 34 8 NY 35.00000 10
5: P1234 S2 2013 450X 34 8 WA 35.00000 10
6: P1234 S2 2013 900X 6 6 NY 35.00000 20
7: P2345 S3 2011 3700 7 7 IL 100.00000 20
8: P2345 S4 2011 3700 88 8 IL -50.00000 -10
9: P3456 S7 2014 A11U 9 9 MN 50.00000 20
10: P4567 S10 2015 2700 100 40 CA 100.00000 40

This is a pretty natural extension of the Q&A linked in the OP.

The by=.EACHI groups by each row of i in x[i,on=,j]; and .N is how many rows the group has.

If you want the rev and qty cols overwritten, use `:=`(Revenue = i.Revenue/.N, Quantity = i.Quantity/.N).

Copy only one variable from one R data.table to another after matching on a variable

If I understand correctly, the OP wants to append column z from DT1 to DT2 as column y where the id columns match.

With data.table, this can be solved using an update join:

library(data.table)
DT2[DT1, on = .(id2 = id1), y := i.z]
DT2
   id2 x  z  y
1: 1 E 21 NA
2: 2 F 22 11
3: 3 G 23 12
4: 4 H 24 13

Note that DT2 is updated by reference, i.e., without copying the whole data object. This might be handy for OP's large production datasets of millions of rows.

Caveat

This works because id1 and id2 are unique which is the case for the sample use case. So, make sure that you get what you want when you do update joins on duplicate values.

Let's see what will happen if there are duplicate values in the id1 column, e.g.

In case DT1 has id1 == 4 duplicated

(DT1 <- data.table(id1 = c(2:4, 4), x = LETTERS[1:4], z = 11:14))
   id1 x  z
1: 2 A 11
2: 3 B 12
3: 4 C 13
4: 4 D 14

then

DT2[DT1, on = .(id2 = id1), y := i.z][]

returns

   id2 x  z  y
1: 1 E 21 NA
2: 2 F 22 11
3: 3 G 23 12
4: 4 H 24 14

So, the update join

  • has not created additional rows in DT2 (which is probably what you may want to avoid copying of large datasets),
  • has picked the last occurrence of z in case of multiple matches.


Related Topics



Leave a reply



Submit