Merging Data.Tables Based on Columns Names

How to merge two data.table by different column names?

OUTDATED

Use this operation:

X[Y]
#    area   id value price sales
# 1:   US c001   100   500    20
# 2:   UK c002   200   200    30
# 3:   EU c003   300   400    15

or this operation:

Y[X]
#      ID price sales area value
# 1: c001   500    20   US   100
# 2: c002   200    30   UK   200
# 3: c003   400    15   EU   300

Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame but data.table uses merge.data.table. Compare

merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table
# Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") : 
# A non-empty vector of column names for `by` is required.

with

merge.data.frame(X, Y, by.x = "id", by.y = "ID")
#     id area value price sales
# 1 c001   US   100   500    20
# 2 c002   UK   200   200    30
# 3 c003   EU   300   400    15

Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table team is planning on implementing by.x and by.y into the merge.data.table function, but hasn't done so yet.

Merge two large data.tables based on column name of one table and column value of the other without melting

Using set():

setkey(DT1, "ID")
setkey(DT2, "ID")
for (k in names(DT1)[-1]) {
  rows <- which(DT2[["col"]] == k)
  set(DT2, i = rows, j = "col_value", DT1[DT2[rows], ..k])
}

   ID  col col_value
1:  A col1         1
2:  A col4        13
3:  B col2         6
4:  B col3        10
5:  C col1         3

Note: Setting the key up front speeds up the process but reorders the rows.

R merging tables, with different column names and retaining all columns

Yes, that's possible:

second[first, on=c(i2="index", t2="type"), nomatch=0L, .(i2, t2, index, type, value, i.value)]

   i2 t2 index type value i.value
1:  a  1     a    1     5       3
2:  a  2     a    2     6       4
3:  b  3     b    3     7       5
4:  c  5     c    5     9       7

Merging 2, 1 row data.tables on column names

We get the intersecting column names and do the assignment

nm1 <- intersect(names(dt1), names(dt2))
dt1[, (nm1) := dt2]

Or we can set the key

setkeyv(dt1, intersect(names(dt1), names(dt2)))
out <- dt1[dt2]
for(j in seq_along(out)) set(out, i = which(is.na(out[[j]])), j=j, value = 0)

merging tables with different column names

Using data.table's subset based joins along with the recently implemented on= argument and nomatch=0L, this is simply:

DT2[DT1, on=c(col5="col2", col4="col3"), nomatch=0L]

See the secondary indices vignette for more.

Alternatively if you've the data.tables keyed, then you can skip the on= argument. But the solution above would be idiomatic as it retains the order of original data.tables, and it is clear to tell what columns are being looked up by looking at the code.

setkey(DT1, col2, col3)
setkey(DT2, col5, col4)
DT2[DT1, nomatch=0L]

See history for older versions.

merging data.tables based on columns names

Update: Since data.table v1.9.6 (released September 19, 2015), merge.data.table() does accept and nicely handles arguments by.x= and by.y=. Here's an updated link to the FR (now closed) referenced below.

Yes this is a feature request not yet implemented :

FR#2033 Add by.x and by.y to merge.data.table

There isn't anything preventing it. Just something that wasn't done. I very rarely need merge and was slow to realise its usefulness more generally. We've made good progress in bringing merge performance as fast as X[Y], and this feature request is at the highest priority. If you'd like it more quickly you are more than welcome to add those arguments to merge.data.table and commit the change yourself. We try to keep source code short and together in one function/file, so by looking at merge.data.table source hopefully you can follow it and see what needs to be done.

Joining tables based on different column names

Update: All the features listed below are implemented and is available in the current stable version of data.table v1.9.6 on CRAN.

There are at least these improvements possible for joins in data.tables.

merge.data.table gaining by.x and by.y arguments
Using secondary keys to join using both forms discussed above without need to set keys, but rather by specifying columns on x and i.

The simplest reason is that we've not managed to get to it yet.

merge two datasets same column name into one

You can use coalesce to merge the a values in the dataframe.

library(dplyr)

table1 %>%
  inner_join(table2, by = c('id', 'date')) %>%
  mutate(a = coalesce(a.x, a.y)) %>%
  select(-a.x, -a.y)

#  id       date    b a
#1  1 2020-01-01 0.10 1
#2  2 2020-01-02 0.20 4
#3  3 2020-01-04 0.30 3
#4  4 2020-01-25 0.25 5

In base R that would be -

transform(merge(table1, table2, by = c('id', 'date')), 
                a = ifelse(is.na(a.x), a.y, a.x))[names(table2)]

data

It is easier to help if you provide data in a reproducible format -

table1 <- structure(list(id = 1:4, date = c("2020-01-01", "2020-01-02", 
"2020-01-04", "2020-01-25"), a = c(1L, 4L, 3L, NA)), row.names = c(NA, 
-4L), class = "data.frame")

table2 <- structure(list(id = 1:4, date = c("2020-01-01", "2020-01-02", 
"2020-01-04", "2020-01-25"), a = c(NA, NA, NA, 5L), b = c(0.1, 
0.2, 0.3, 0.25)), row.names = c(NA, -4L), class = "data.frame")

Merging two Datatables with different column names

Probably the problem is caused by the fact that Merge uses the PrimaryKey of the table to find an existing record to update and if it can't find it then add the new record. If this is the case then you should disable the PrimaryKey info retrieved when you have filled the table through the data adapter.

dataTable1.PrimaryKey = Nothing
dataTable2.PrimaryKey = Nothing
dataTable1.Merge(dataTable2, false, MissingSchemaAction.Add)
....

Now Merge cannot find the matches and thus every record in dataTabl2 is added to the dataTable1. However I should warn you to keep an eye on the performances and correctness of other operations on this dataTable1.

Now there is no PrimaryKey set and this could be a source of problems in updating and deleting a row (if you have these operations of course)