How to merge two data.table by different column names?
OUTDATED
Use this operation:
X[Y]
# area id value price sales
# 1: US c001 100 500 20
# 2: UK c002 200 200 30
# 3: EU c003 300 400 15
or this operation:
Y[X]
# ID price sales area value
# 1: c001 500 20 US 100
# 2: c002 200 30 UK 200
# 3: c003 400 15 EU 300
Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge
and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame
but data.table uses merge.data.table
. Compare
merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table
# Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") :
# A non-empty vector of column names for `by` is required.
with
merge.data.frame(X, Y, by.x = "id", by.y = "ID")
# id area value price sales
# 1 c001 US 100 500 20
# 2 c002 UK 200 200 30
# 3 c003 EU 300 400 15
Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table
team is planning on implementing by.x
and by.y
into the merge.data.table
function, but hasn't done so yet.
Merge two large data.tables based on column name of one table and column value of the other without melting
Using set()
:
setkey(DT1, "ID")
setkey(DT2, "ID")
for (k in names(DT1)[-1]) {
rows <- which(DT2[["col"]] == k)
set(DT2, i = rows, j = "col_value", DT1[DT2[rows], ..k])
}
ID col col_value
1: A col1 1
2: A col4 13
3: B col2 6
4: B col3 10
5: C col1 3
Note: Setting the key up front speeds up the process but reorders the rows.
R merging tables, with different column names and retaining all columns
Yes, that's possible:
second[first, on=c(i2="index", t2="type"), nomatch=0L, .(i2, t2, index, type, value, i.value)]
i2 t2 index type value i.value
1: a 1 a 1 5 3
2: a 2 a 2 6 4
3: b 3 b 3 7 5
4: c 5 c 5 9 7
Merging 2, 1 row data.tables on column names
We get the intersect
ing column names and do the assignment
nm1 <- intersect(names(dt1), names(dt2))
dt1[, (nm1) := dt2]
Or we can set
the key
setkeyv(dt1, intersect(names(dt1), names(dt2)))
out <- dt1[dt2]
for(j in seq_along(out)) set(out, i = which(is.na(out[[j]])), j=j, value = 0)
merging tables with different column names
Using data.table's subset based joins along with the recently implemented on=
argument and nomatch=0L
, this is simply:
DT2[DT1, on=c(col5="col2", col4="col3"), nomatch=0L]
See the secondary indices vignette for more.
Alternatively if you've the data.tables keyed, then you can skip the on=
argument. But the solution above would be idiomatic as it retains the order of original data.tables, and it is clear to tell what columns are being looked up by looking at the code.
setkey(DT1, col2, col3)
setkey(DT2, col5, col4)
DT2[DT1, nomatch=0L]
See history for older versions.
merging data.tables based on columns names
Update: Since data.table v1.9.6 (released September 19, 2015), merge.data.table()
does accept and nicely handles arguments by.x=
and by.y=
. Here's an updated link to the FR (now closed) referenced below.
Yes this is a feature request not yet implemented :
FR#2033 Add by.x and by.y to merge.data.table
There isn't anything preventing it. Just something that wasn't done. I very rarely need merge
and was slow to realise its usefulness more generally. We've made good progress in bringing merge
performance as fast as X[Y]
, and this feature request is at the highest priority. If you'd like it more quickly you are more than welcome to add those arguments to merge.data.table
and commit the change yourself. We try to keep source code short and together in one function/file, so by looking at merge.data.table
source hopefully you can follow it and see what needs to be done.
Joining tables based on different column names
Update: All the features listed below are implemented and is available in the current stable version of data.table v1.9.6
on CRAN.
There are at least these improvements possible for joins in data.tables.
merge.data.table
gainingby.x
andby.y
argumentsUsing secondary keys to join using both forms discussed above without need to set keys, but rather by specifying columns on
x
andi
.
The simplest reason is that we've not managed to get to it yet.
merge two datasets same column name into one
You can use coalesce
to merge the a
values in the dataframe.
library(dplyr)
table1 %>%
inner_join(table2, by = c('id', 'date')) %>%
mutate(a = coalesce(a.x, a.y)) %>%
select(-a.x, -a.y)
# id date b a
#1 1 2020-01-01 0.10 1
#2 2 2020-01-02 0.20 4
#3 3 2020-01-04 0.30 3
#4 4 2020-01-25 0.25 5
In base R that would be -
transform(merge(table1, table2, by = c('id', 'date')),
a = ifelse(is.na(a.x), a.y, a.x))[names(table2)]
data
It is easier to help if you provide data in a reproducible format -
table1 <- structure(list(id = 1:4, date = c("2020-01-01", "2020-01-02",
"2020-01-04", "2020-01-25"), a = c(1L, 4L, 3L, NA)), row.names = c(NA,
-4L), class = "data.frame")
table2 <- structure(list(id = 1:4, date = c("2020-01-01", "2020-01-02",
"2020-01-04", "2020-01-25"), a = c(NA, NA, NA, 5L), b = c(0.1,
0.2, 0.3, 0.25)), row.names = c(NA, -4L), class = "data.frame")
Merging two Datatables with different column names
Probably the problem is caused by the fact that Merge uses the PrimaryKey of the table to find an existing record to update and if it can't find it then add the new record. If this is the case then you should disable the PrimaryKey info retrieved when you have filled the table through the data adapter.
dataTable1.PrimaryKey = Nothing
dataTable2.PrimaryKey = Nothing
dataTable1.Merge(dataTable2, false, MissingSchemaAction.Add)
....
Now Merge cannot find the matches and thus every record in dataTabl2 is added to the dataTable1. However I should warn you to keep an eye on the performances and correctness of other operations on this dataTable1.
Now there is no PrimaryKey set and this could be a source of problems in updating and deleting a row (if you have these operations of course)
Related Topics
Separate a Column into 2 Columns at the Last Underscore in R
Installing "Rgl" Package in R, MAC Osx El Captian
Difference Between Backticks and Quotes in Aes Function in Ggplot
Add a Vector to All Rows of a Matrix
Positioning Shiny Widgets Beside Their Headers
R Plot: Using Italics and a Variable in a Title
As_Labeller with Expression() in Ggplot2 Facet_Wrap
R: Further Subset a Selection Using the Pipe %>% and Placeholder
How to Neatly Align the Regression Equation and R2 and P Value
Nested Facet Plot with Ggplot2
Do You Reassign == and != to Istrue( All.Equal() )
Axis Labels for Each Bar and Each Group in Bar Charts with Dodged Groups
How to Colour the Labels of a Dendrogram by an Additional Factor Variable in R
Calculate Average Over Multiple Data Frames
R Ggplot2: Labeling a Horizontal Line Without Associating the Label with a Series