Join Two Data Tables and Use Only One Column from Second Dt

Join two data tables and use only one column from second dt

In order to perform a left join to df1 and add H column from df2, you can combine binary join with the update by reference operator (:=)

setkey(setDT(dt1), A) 
dt1[dt2, H := i.H]

See here and here for detailed explanation on how it works


With the devel version (v >= 1.9.5) we could make it even shorter by specifying the key within setDT (as pointed by @Arun)

setDT(dt1, key = "A")[dt2, H := i.H]

Edit 24/7/2015

You can now run a binary join using the new on parameter without setting keys

setDT(dt1)[dt2, H := i.H, on = c(A = "E")]

How to combine two or more columns of data table to one column?

You may use tidyr::unite -

dt.test <- tidyr::unite(dt.test, date, year:hour, sep = '-')
dt.test

# date id type
#1: 2018-01-01-00 8750 ist
#2: 2018-01-02-01 3048 plan
#3: 2018-01-03-02 3593 ist
#4: 2018-01-04-03 8475 plan

Combine two data tables in R by a condition referring to two columns

So, we have two solutions that work!

Version 1:
Adapted from Frank's comment above:

 library(dplyr)
final <- dt2[col1 > col2, c("col1", "col2") := .(col2, col1)]
final <- dt1[dt2, on=.(col1, col2)]
final <- select(final, col1, col2, x, y) # select relevant columns
final
col1 col2 x y
1: bb zz 29 34
2: aa bb 130 567
3: cc dd 122 56
4: dd ff 85 101

Version2: This is just a tweak of PritamJ's answer that simplifies a few things and makes this solution more applicable for large data tables. Hope it helps other people as well!

library(dplyr)
dt1$pairs <- paste(dt1$col1, dt1$col2) # creates new column with col1 and col2
merged into one
dt2$pairs <- paste(dt2$col1, dt2$col2) # same here
dt2$revpairs <- paste(dt2$col2, dt2$col1) # creates new column with reverse pairs

f1 <- merge(dt1, dt2, by="pairs") # merge by pairs as they are in dt1
f1 <- select(f1, col1.x, col2.x, x, y) # select by name (easier for big dt)

f2 <- merge(dt1, dt2, by.x = "pairs", by.y = "revpairs") # merge by pairs and reverse pairs
colnames(f2)[ncol(f2)] <- "revpairs" # rename last column because it has the same name as the first, which can cause errors
f2 <- select(f2, col1.x, col2.x, x, y)

final <- bind_rows(f2, f1) # bind the two together
colnames(final)[1:2] <- c("col1", "col2") # this is not necessary, just for clarity
final
col1 col2 x y
1: aa bb 130 567
2: bb zz 29 34
3: dd ff 85 101
4: cc dd 122 56

data.table join (multiple) selected columns with new names

Updated answer based on Arun's recommendation:

cols_old <- c('i.a', 'i.b')
DT1[DT2, (cols_new) := mget(cols_old), on = c(id = "id")]

you could also generate the cols_old by doing:

paste0('i.', gsub('_new', '', cols_new, fixed = TRUE))

See history for the old answer.

r data.table Join In Place Multiple Columns

I should have looked at one more questions which linked to this awesome reference.. All I needed to do was use the funcional form of the := operator.

dt2[dt1, `:=` (col2 = i.col2, 
col3 = i.col3)]

dt2
col1 another_col and_anouther col2 col3
1: a 3 FALSE 1 TRUE
2: a 8 TRUE 1 TRUE
3: a 8 TRUE 1 TRUE
4: b 2 TRUE 2 FALSE
5: b 7 FALSE 2 FALSE
6: b 10 TRUE 2 FALSE
7: b 4 FALSE 2 FALSE
8: c 4 TRUE 3 FALSE
9: c 5 TRUE 3 FALSE
10: c 8 TRUE 3 FALSE

Data Table R: Merge selected columns from multiple data.table

Just change the by = "ID" to by = c("ID", "FDR", "logFC") and the argument allow.cartesian should be inside the merge

DT.comb <- Reduce(function(...) merge.data.table(...,
by= c("ID", "FDR", "LogFC"), all = TRUE, allow.cartesian=TRUE), dt.list)

R data.table join two tables and keep all rows

This is cross join assign a New Key to help merge

DT1$Key=1
DT2$Key=1
DT3=merge(DT1,DT2,by='Key')
DT3 #DT3$Key=NULL remove the key
Key ID_1 val_1 ID_2 val_2
1: 1 1 1 3 3
2: 1 1 1 4 4
3: 1 2 2 3 3
4: 1 2 2 4 4

Which data.table syntax for left join (one column) to prefer

I prefer the "update join" idiom for efficiency and maintainability:**

DT[WHERE, v := FROM[.SD, on=, x.v]]

It's an extension of what is shown in vignette("datatable-reference-semantics") under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.

This is efficient since it only uses the rows selected by WHERE and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=].

It makes my code more readable since I can easily see that the point of the join is to add column v; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.

It is useful for code maintenance since if I later want to find out how DT got a column named v, I can search my code for v :=, while FROM[DT, on=] obscures which new columns are being added. Also, it allows the WHERE condition, while the left join does not. This may be useful, for example, if using FROM to "fill" NAs in an existing column v.


Compared with the other update join approach DT[FROM, on=, v := i.v], I can think of two advantages. First is the option of using the WHERE clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM conditional on the on= rules. Here's an illustration extending the OP's example:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
id = c("c", "d", "e", "e"),
ord = 1:4,
comment = c("big", "slow", "nice", "nooice")
)

# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu)
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu)
# Detected that j uses these columns: comment,i.comment
# Assigning to 4 row subset of 10 rows

# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
# Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)

id amount comment comment2
1: a 0.20000990 <NA> <NA>
2: b 1.42146573 <NA> <NA>
3: c 0.73047544 big big
4: d 0.04128676 slow slow
5: e 0.82195377 nooice nice
6: f 0.39013550 <NA> nooice
7: g 0.27019768 <NA> <NA>
8: h 0.36017876 <NA> <NA>
9: i 1.81865721 <NA> <NA>
10: j 4.86711754 <NA> <NA>

In the left-join-flavored update, you silently get the final value of comment even though there are two matches for id == "e"; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.


I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.

** In this idiom, the on= part should be filled in with the join column names and rules, like on=.(id) or on=.(from_date >= dt_date). Further join rules can be passed with roll=, mult= and nomatch=. See ?data.table for details. Thanks to @RYoda for noting this point in the comments.

Here is a more complicated example from Matt Dowle explaining roll=: Find time to nearest occurrence of particular value for each row

Another related example: Left join using data.table

Join two tables with common column names but no related data

I think what you are looking for is UNION query like below

select userid, username, incomeid, incomeamount, null as ExpenseID, null as expenseAmount
from table1
union
select userid, username, null as incomeid, null as incomeamount, null as ExpenseID, null as expenseAmount
from table2

R: Group and merge datatables without resulting in dataframe

Update:

Here are two ways to go from your dt1 and dt2 to your dt:

cbind(dt1, dt2[,3:4])
# prm1 prm2 obs1 obs2 obs3 obs4
# 1: 1 1 1 7 13 19
# 2: 1 1 2 8 14 20
# 3: 2 2 3 9 15 21
# 4: 2 2 4 10 16 22
# 5: 3 3 5 11 17 23
# 6: 3 3 6 12 18 24

That is obviously very sensitive to the number of rows, and will fail without much effort.

An alternative is to add a "row-number within a group" (using your group assumption), and include that in the join parameters.

dt1[,n := seq_len(.N), by = .(prm1, prm2)]
dt2[,n := seq_len(.N), by = .(prm1, prm2)]
merge(dt1, dt2, by = c("prm1", "prm2", "n"))
# prm1 prm2 n obs1 obs2 obs3 obs4
# 1: 1 1 1 1 7 13 19
# 2: 1 1 2 2 8 14 20
# 3: 2 2 1 3 9 15 21
# 4: 2 2 2 4 10 16 22
# 5: 3 3 1 5 11 17 23
# 6: 3 3 2 6 12 18 24

This is a complete inference, that row-number within a prm1/prm2 group is meaninful.

If neither of these work with the real data, then either (a) there is a bit of post-merge filtering that needs to be done (contextual, so I don't know), or (b) we have a problem. The problem with the same merge but without n is that each group has more than 1 row, on either or both sides, meaning there will be a cartesian expansion.


Don't group_by your data, this is doing two things:

  1. Changing the class from data.table to grouped_df, tbl_df. This is because group_by is a dplyr function that operates by adding an attribute to the frame that indicates which rows belong to each group. Only tbl_df-based functions honor this attribute, so it needs to change the class from data.table to tbl_df (with grouped_df, etc).

  2. It is doing something for no reason, so it is wasting time (though really not much). The theory behind the merge/join is that the frames will be column-wise combined based on the keys. Your "grouping" is intended (I think) to ensure that only matching param* variables are joined, when in fact the way to think of it is in the parlance of joins, typically from one of: full, semi, left, right, inner, or anti. All of there are natively supported in dplyr with well-named functions; most are enabled directly in base::merge and data.table::merge.

    Two links are really good at explaining and visualizing what is going on with the various types of merge: How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272.

The default behavior of base::merge (and, though not documented as such, data.table::merge as well) in the absence of an explicit by= argument (or by.x/by.y) is to infer the columns based on intersect(names(x), names(y)), which is very often the desired behavior. I discourage this in programmatic use, though, as it can lead to mistakes when data is not always shaped/named perfectly. (The dplyr join verbs all provide messages when inference is made.)

If we start with your original not-grouped (and therefore still-data.table) dt1 and dt2 objects, then we should be able to do one of the following, preserving the data.table class:

# inferential "by" columns, not great
merge(dt1, dt2)

# default behavior, now explicit
merge(dt1, dt2, by = intersect(names(dt1), names(dt2)))

# slightly better: it will error if any params are in x and not y, our assumption
merge(dt1, dt2, by = grep("^param", names(dt1), value = TRUE))

# data.table-esque, same left-join (no full-join in this format)
dt1[dt2, on = intersect(names(dt1), names(dt2))]

One good reference for data.table joins: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html.



Related Topics



Leave a reply



Submit