Join two data tables and use only one column from second dt
In order to perform a left join to df1
and add H
column from df2
, you can combine binary join with the update by reference operator (:=
)
setkey(setDT(dt1), A)
dt1[dt2, H := i.H]
See here and here for detailed explanation on how it works
With the devel version (v >= 1.9.5) we could make it even shorter by specifying the key
within setDT
(as pointed by @Arun)
setDT(dt1, key = "A")[dt2, H := i.H]
Edit 24/7/2015
You can now run a binary join using the new on
parameter without setting keys
setDT(dt1)[dt2, H := i.H, on = c(A = "E")]
How to combine two or more columns of data table to one column?
You may use tidyr::unite
-
dt.test <- tidyr::unite(dt.test, date, year:hour, sep = '-')
dt.test
# date id type
#1: 2018-01-01-00 8750 ist
#2: 2018-01-02-01 3048 plan
#3: 2018-01-03-02 3593 ist
#4: 2018-01-04-03 8475 plan
Combine two data tables in R by a condition referring to two columns
So, we have two solutions that work!
Version 1:
Adapted from Frank's comment above:
library(dplyr)
final <- dt2[col1 > col2, c("col1", "col2") := .(col2, col1)]
final <- dt1[dt2, on=.(col1, col2)]
final <- select(final, col1, col2, x, y) # select relevant columns
final
col1 col2 x y
1: bb zz 29 34
2: aa bb 130 567
3: cc dd 122 56
4: dd ff 85 101
Version2: This is just a tweak of PritamJ's answer that simplifies a few things and makes this solution more applicable for large data tables. Hope it helps other people as well!
library(dplyr)
dt1$pairs <- paste(dt1$col1, dt1$col2) # creates new column with col1 and col2
merged into one
dt2$pairs <- paste(dt2$col1, dt2$col2) # same here
dt2$revpairs <- paste(dt2$col2, dt2$col1) # creates new column with reverse pairs
f1 <- merge(dt1, dt2, by="pairs") # merge by pairs as they are in dt1
f1 <- select(f1, col1.x, col2.x, x, y) # select by name (easier for big dt)
f2 <- merge(dt1, dt2, by.x = "pairs", by.y = "revpairs") # merge by pairs and reverse pairs
colnames(f2)[ncol(f2)] <- "revpairs" # rename last column because it has the same name as the first, which can cause errors
f2 <- select(f2, col1.x, col2.x, x, y)
final <- bind_rows(f2, f1) # bind the two together
colnames(final)[1:2] <- c("col1", "col2") # this is not necessary, just for clarity
final
col1 col2 x y
1: aa bb 130 567
2: bb zz 29 34
3: dd ff 85 101
4: cc dd 122 56
data.table join (multiple) selected columns with new names
Updated answer based on Arun's recommendation:
cols_old <- c('i.a', 'i.b')
DT1[DT2, (cols_new) := mget(cols_old), on = c(id = "id")]
you could also generate the cols_old
by doing:
paste0('i.', gsub('_new', '', cols_new, fixed = TRUE))
See history for the old answer.
r data.table Join In Place Multiple Columns
I should have looked at one more questions which linked to this awesome reference.. All I needed to do was use the funcional form of the :=
operator.
dt2[dt1, `:=` (col2 = i.col2,
col3 = i.col3)]
dt2
col1 another_col and_anouther col2 col3
1: a 3 FALSE 1 TRUE
2: a 8 TRUE 1 TRUE
3: a 8 TRUE 1 TRUE
4: b 2 TRUE 2 FALSE
5: b 7 FALSE 2 FALSE
6: b 10 TRUE 2 FALSE
7: b 4 FALSE 2 FALSE
8: c 4 TRUE 3 FALSE
9: c 5 TRUE 3 FALSE
10: c 8 TRUE 3 FALSE
Data Table R: Merge selected columns from multiple data.table
Just change the by = "ID"
to by = c("ID", "FDR", "logFC")
and the argument allow.cartesian
should be inside the merge
DT.comb <- Reduce(function(...) merge.data.table(...,
by= c("ID", "FDR", "LogFC"), all = TRUE, allow.cartesian=TRUE), dt.list)
R data.table join two tables and keep all rows
This is cross join assign a New Key to help merge
DT1$Key=1
DT2$Key=1
DT3=merge(DT1,DT2,by='Key')
DT3 #DT3$Key=NULL remove the key
Key ID_1 val_1 ID_2 val_2
1: 1 1 1 3 3
2: 1 1 1 4 4
3: 1 2 2 3 3
4: 1 2 2 4 4
Which data.table syntax for left join (one column) to prefer
I prefer the "update join" idiom for efficiency and maintainability:**
DT[WHERE, v := FROM[.SD, on=, x.v]]
It's an extension of what is shown in vignette("datatable-reference-semantics")
under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.
This is efficient since it only uses the rows selected by WHERE
and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=]
.
It makes my code more readable since I can easily see that the point of the join is to add column v
; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.
It is useful for code maintenance since if I later want to find out how DT
got a column named v
, I can search my code for v :=
, while FROM[DT, on=]
obscures which new columns are being added. Also, it allows the WHERE
condition, while the left join does not. This may be useful, for example, if using FROM
to "fill" NAs in an existing column v
.
Compared with the other update join approach DT[FROM, on=, v := i.v]
, I can think of two advantages. First is the option of using the WHERE
clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM
conditional on the on=
rules. Here's an illustration extending the OP's example:
library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
id = c("c", "d", "e", "e"),
ord = 1:4,
comment = c("big", "slow", "nice", "nooice")
)
# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu)
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu)
# Detected that j uses these columns: comment,i.comment
# Assigning to 4 row subset of 10 rows
# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
# Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)
id amount comment comment2
1: a 0.20000990 <NA> <NA>
2: b 1.42146573 <NA> <NA>
3: c 0.73047544 big big
4: d 0.04128676 slow slow
5: e 0.82195377 nooice nice
6: f 0.39013550 <NA> nooice
7: g 0.27019768 <NA> <NA>
8: h 0.36017876 <NA> <NA>
9: i 1.81865721 <NA> <NA>
10: j 4.86711754 <NA> <NA>
In the left-join-flavored update, you silently get the final value of comment
even though there are two matches for id == "e"
; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE
with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.
I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.
** In this idiom, the on=
part should be filled in with the join column names and rules, like on=.(id)
or on=.(from_date >= dt_date)
. Further join rules can be passed with roll=
, mult=
and nomatch=
. See ?data.table
for details. Thanks to @RYoda for noting this point in the comments.
Here is a more complicated example from Matt Dowle explaining roll=
: Find time to nearest occurrence of particular value for each row
Another related example: Left join using data.table
Join two tables with common column names but no related data
I think what you are looking for is UNION query like below
select userid, username, incomeid, incomeamount, null as ExpenseID, null as expenseAmount
from table1
union
select userid, username, null as incomeid, null as incomeamount, null as ExpenseID, null as expenseAmount
from table2
R: Group and merge datatables without resulting in dataframe
Update:
Here are two ways to go from your dt1
and dt2
to your dt
:
cbind(dt1, dt2[,3:4])
# prm1 prm2 obs1 obs2 obs3 obs4
# 1: 1 1 1 7 13 19
# 2: 1 1 2 8 14 20
# 3: 2 2 3 9 15 21
# 4: 2 2 4 10 16 22
# 5: 3 3 5 11 17 23
# 6: 3 3 6 12 18 24
That is obviously very sensitive to the number of rows, and will fail without much effort.
An alternative is to add a "row-number within a group" (using your group assumption), and include that in the join parameters.
dt1[,n := seq_len(.N), by = .(prm1, prm2)]
dt2[,n := seq_len(.N), by = .(prm1, prm2)]
merge(dt1, dt2, by = c("prm1", "prm2", "n"))
# prm1 prm2 n obs1 obs2 obs3 obs4
# 1: 1 1 1 1 7 13 19
# 2: 1 1 2 2 8 14 20
# 3: 2 2 1 3 9 15 21
# 4: 2 2 2 4 10 16 22
# 5: 3 3 1 5 11 17 23
# 6: 3 3 2 6 12 18 24
This is a complete inference, that row-number within a prm1/prm2
group is meaninful.
If neither of these work with the real data, then either (a) there is a bit of post-merge
filtering that needs to be done (contextual, so I don't know), or (b) we have a problem. The problem with the same merge but without n
is that each group has more than 1 row, on either or both sides, meaning there will be a cartesian expansion.
Don't group_by
your data, this is doing two things:
Changing the class from
data.table
togrouped_df, tbl_df
. This is becausegroup_by
is adplyr
function that operates by adding an attribute to the frame that indicates which rows belong to each group. Onlytbl_df
-based functions honor this attribute, so it needs to change the class fromdata.table
totbl_df
(withgrouped_df
, etc).It is doing something for no reason, so it is wasting time (though really not much). The theory behind the merge/join is that the frames will be column-wise combined based on the keys. Your "grouping" is intended (I think) to ensure that only matching
param*
variables are joined, when in fact the way to think of it is in the parlance of joins, typically from one of: full, semi, left, right, inner, or anti. All of there are natively supported indplyr
with well-named functions; most are enabled directly inbase::merge
anddata.table::merge
.Two links are really good at explaining and visualizing what is going on with the various types of merge: How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272.
The default behavior of base::merge
(and, though not documented as such, data.table::merge
as well) in the absence of an explicit by=
argument (or by.x
/by.y
) is to infer the columns based on intersect(names(x), names(y))
, which is very often the desired behavior. I discourage this in programmatic use, though, as it can lead to mistakes when data is not always shaped/named perfectly. (The dplyr
join verbs all provide messages when inference is made.)
If we start with your original not-grouped (and therefore still-data.table
) dt1
and dt2
objects, then we should be able to do one of the following, preserving the data.table
class:
# inferential "by" columns, not great
merge(dt1, dt2)
# default behavior, now explicit
merge(dt1, dt2, by = intersect(names(dt1), names(dt2)))
# slightly better: it will error if any params are in x and not y, our assumption
merge(dt1, dt2, by = grep("^param", names(dt1), value = TRUE))
# data.table-esque, same left-join (no full-join in this format)
dt1[dt2, on = intersect(names(dt1), names(dt2))]
One good reference for data.table
joins: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html.
Related Topics
Error in If/While (Condition):Argument Is Not Interpretable as Logical
Extract Last Non-Missing Value in Row with Data.Table
How to Keep My Subtitles When I Use Ggplotly()
Let Ggplot2 Histogram Show Classwise Percentages on Y Axis
Repeat the Re-Sampling Function for 1000 Times? Using Lapply
Control Alpha Blending/Opacity of N Overlapping Areas
Store Arrangegrob to Object, Does Not Create Printable Object
How to 'Compress' an Lm() Object for Later Prediction
Match Two Columns with Two Other Columns
R Ggplot Ordering Bars in "Barplot-Like " Plot
How to Remove Columns from a Data.Frame by Data Type
Dygraph in R Multiple Plots at Once
Histogram Conditional Fill Color
Ggplot and R: Two Variables Over Time