data.table join then add columns to existing data.frame without re-copy
This is easy to do:
X[Y, z := i.z]
It works because the only difference between Y[X]
and X[Y]
here, is when some elements are not in Y
, in which case presumably you'd want z
to be NA
, which the above assignment will exactly do.
It would also work just as well for many variables:
X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]
Since you require the operation Y[X]
, you can add the argument nomatch=0
(as @mnel points out) so as to not get NAs for those where X doesn't contain the key values from Y. That is:
X[Y, z := i.z, nomatch=0]
From the NEWS for data.table
**********************************************
** **
** CHANGES IN DATA.TABLE VERSION 1.7.10 **
** **
**********************************************
NEW FEATURES
o The prefix i. can now be used in j to refer to join inherited
columns of i that are otherwise masked by columns in x with
the same name.
r - data.table join and then add all columns from one table to another
Just create a function that takes names as arguments and constructs the expression for you. And then eval
it each time by passing the names of each data.table
you require. Here's an illustration:
get_expr <- function(x) {
# 'x' is the names vector
expr = paste0("i.", x)
expr = lapply(expr, as.name)
setattr(expr, 'names', x)
as.call(c(quote(`:=`), expr))
}
> get_expr('value') ## generates the required expression
# `:=`(value = i.value)
template[x, eval(get_expr("value"))]
template[y, eval(get_expr("value"))]
# id1 id2 value
# 1: a 1 NA
# 2: a 2 0.01649728
# 3: a 3 -0.27918482
# 4: a 4 -1.16343900
# 5: a 5 NA
# 6: b 1 NA
# 7: b 2 NA
# 8: b 3 0.86933718
# 9: b 4 2.26787200
# 10: b 5 1.08325800
How can I add additional columns to an existing data.frame, that are aligned on one specific column already in the data.frame?
All right, my best guess is your data looks something like this (though probably much bigger):
library(data.table)
set.seed(47)
nn_data_sample = data.table(
yearbuilt = rep(c(1938, 1942, 1951, 1963), each = 4),
ZIP = sample(c(90210, 19145, 19146, 19147, 19148, 19149), size = 16, replace = TRUE)
)
nn_data_sample
# yearbuilt ZIP
# 1: 1938 19149
# 2: 1938 19146
# 3: 1938 19148
# 4: 1938 19148
# 5: 1942 19147
# 6: 1942 19148
# 7: 1942 19146
# 8: 1942 19146
# 9: 1951 19147
This is nicely formatted data, in long format, which is easy to work with. You seem to want to (a) count rows by zipcode and by the decade they were built (more-or-less, with a little more granularity recently), and then (b) convert the long data (with one zipcode column and one time column) into a wide format, where the times are spread across many columns.
For (a), we will use the cut
function to divide the years into the decade-like intervals you want, and then aggregate the rows by zip code and decade.
decade_data = nn_data_sample[, decade_built := cut(
yearbuilt,
breaks = c(0, seq(1939, 1999, by = 10), 2004, Inf))
][, .(n = .N), by = .(decade_built, ZIP)]
decade_data
# decade_built ZIP n
# 1: (0,1939] 19149 1
# 2: (0,1939] 19146 1
# 3: (0,1939] 19148 2
# 4: (1939,1949] 19147 1
# 5: (1939,1949] 19148 1
# 6: (1939,1949] 19146 2
# 7: (1949,1959] 19147 1
# 8: (1949,1959] 19149 1
# ...
For a lot of use cases, this is a great format to work with---data.table makes it easy to do things "by group", so if you have more operations you want to do to each decade, this should be your starting point. (Since we used :=
the decade_built
column became part of the original data, you can look at it to verify that it worked right.)
But, if you want to change to wide format, dcast
does that for us:
dcast(decade_data, ZIP ~ decade_built, value.var = "n")
# ZIP (0,1939] (1939,1949] (1949,1959] (1959,1969]
# 1: 19146 1 2 NA NA
# 2: 19147 NA 1 1 2
# 3: 19148 2 1 1 NA
# 4: 19149 1 NA 1 1
# 5: 90210 NA NA 1 1
If you want to edit the column names, you can either specify what you want from the top, using the labels
argument of the cut
function, or simply rename the columns at the end. Or do it in the middle, modifying the values of the decade_built
column after it's created---do it wherever feels easiest.
How to conditionally replace R data.table columns upon merge?
We can use the on
based approach
dt1[dt2, column1 := i.column1, on = .(index_column)]
dt1
# index_column column1 column2
#1: 12 dog 482
#2: 17 cat 391
#3: 29 penguin 567
#4: 34 elephant 182
#5: 46 bird 121
How to take data from one dataframe and copy it into existing columns in another dataframe based on the shared ID of a third column
We cannot use the original dataset 'df1' columns after the join becuase it is a left_join
. In tidyverse
, we specify the unquoted column names. There is no all.x
argument in left_join
. It should be from merge
library(dplyr)
left_join(x=df1, y=df2, by = "color.band") %>%
mutate(y = ifelse(is.na(color.band), bandnum, color.band))
Adding columns to inner join (data.table)
To select columns while merging two data.table
s (like you are doing), you should not use a dollar symbol. You can prefix the names of the columns in A and B (in the merge of the A[B]
) by x.
and i.
respectively (see below).
What you're missing is that, in your examples, you are selecting the columns in the original dataset (which has 3 rows) and not in the (inner)joined dataset which has 2 rows.
A[B, .(A.id = x.id), on=.(id), nomatch=NULL] # prefixing id with x. select id from A
# A.id
# 1: 1
# 2: 2
A[B, .(i.id), on=.(id), nomatch=NULL] # prefixing id with i. select id from B
# i.id
# 1: 1
# 2: 2
A[B, .(A.id = x.id, B.id= i.id, A.x_val = x.x_val, i.y_val), on=.(id), nomatch=NULL]
# A.id B.id A.x_val i.y_val
# 1: 1 1 x1 y1
# 2: 2 2 x2 y2
How to do an X[Y] data.table join, without losing an existing main key on X?
With secondary keys implemented (since v1.9.6) and the recent bug fix on retaining/discarding keys properly (in v1.9.7), you can now do this using on=
:
# join
DT[x2y, on="x"] # key is removed as row order gets changed.
# update using joins
DT[x2y, y:=y, on="x"] # key is retained, as row order isn't changed.
Join then mutate using data.table without intermediate table
It's more efficient to simply add columns to DFI (in an "update join"), rather than making a new table:
DFI[DF_Lookup, on=.(PO_ID, SO_ID, F_Year, Product_ID),
`:=`(newrev = i.Revenue/.N, newqty = i.Quantity/.N)
, by=.EACHI]
PO_ID SO_ID F_Year Product_ID Revenue Quantity Location1 newrev newqty
1: P1234 S1 2012 385X 1 1 MA 16.66667 1
2: P1234 S1 2012 385X 2 2 NY 16.66667 1
3: P1234 S1 2012 385X 3 3 WA 16.66667 1
4: P1234 S2 2013 450X 34 8 NY 35.00000 10
5: P1234 S2 2013 450X 34 8 WA 35.00000 10
6: P1234 S2 2013 900X 6 6 NY 35.00000 20
7: P2345 S3 2011 3700 7 7 IL 100.00000 20
8: P2345 S4 2011 3700 88 8 IL -50.00000 -10
9: P3456 S7 2014 A11U 9 9 MN 50.00000 20
10: P4567 S10 2015 2700 100 40 CA 100.00000 40
This is a pretty natural extension of the Q&A linked in the OP.
The by=.EACHI
groups by each row of i
in x[i,on=,j]
; and .N
is how many rows the group has.
If you want the rev and qty cols overwritten, use `:=`(Revenue = i.Revenue/.N, Quantity = i.Quantity/.N)
.
Copy only one variable from one R data.table to another after matching on a variable
If I understand correctly, the OP wants to append column z
from DT1
to DT2
as column y
where the id columns match.
With data.table, this can be solved using an update join:
library(data.table)
DT2[DT1, on = .(id2 = id1), y := i.z]
DT2
id2 x z y
1: 1 E 21 NA
2: 2 F 22 11
3: 3 G 23 12
4: 4 H 24 13
Note that DT2
is updated by reference, i.e., without copying the whole data object. This might be handy for OP's large production datasets of millions of rows.
Caveat
This works because id1
and id2
are unique which is the case for the sample use case. So, make sure that you get what you want when you do update joins on duplicate values.
Let's see what will happen if there are duplicate values in the id1
column, e.g.
In case DT1
has id1 == 4
duplicated
(DT1 <- data.table(id1 = c(2:4, 4), x = LETTERS[1:4], z = 11:14))
id1 x z
1: 2 A 11
2: 3 B 12
3: 4 C 13
4: 4 D 14
then
DT2[DT1, on = .(id2 = id1), y := i.z][]
returns
id2 x z y
1: 1 E 21 NA
2: 2 F 22 11
3: 3 G 23 12
4: 4 H 24 14
So, the update join
- has not created additional rows in
DT2
(which is probably what you may want to avoid copying of large datasets), - has picked the last occurrence of
z
in case of multiple matches.
Related Topics
Merge Panel Data to Get Balanced Panel Data
Repeat Vector When Its Length Is Not a Multiple of Desired Total Length
R Fuzzy String Match to Return Specific Column Based on Matched String
How to Convert Time (Mm:Ss) to Decimal Form in R
Delete Duplicate Rows in Two Columns Simultaneously
How to Change Type of Target Column When Doing := by Group in a Data.Table in R
Reasons That Ggplot2 Legend Does Not Appear
How to Read CSV File in R Where Some Values Contain the Percent Symbol (%)
Counting the Frequency of an Element in a Data Frame
Plot a Function with Ggplot, Equivalent of Curve()
Controlling Order of Facet_Grid/Facet_Wrap in Ggplot2
Can Dplyr Join on Multiple Columns or Composite Key