How to Merge Two Data.Table by Different Column Names

How to merge two data.table by different column names?

OUTDATED


Use this operation:

X[Y]
# area id value price sales
# 1: US c001 100 500 20
# 2: UK c002 200 200 30
# 3: EU c003 300 400 15

or this operation:

Y[X]
# ID price sales area value
# 1: c001 500 20 US 100
# 2: c002 200 30 UK 200
# 3: c003 400 15 EU 300

Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame but data.table uses merge.data.table. Compare

merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table
# Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") :
# A non-empty vector of column names for `by` is required.

with

merge.data.frame(X, Y, by.x = "id", by.y = "ID")
# id area value price sales
# 1 c001 US 100 500 20
# 2 c002 UK 200 200 30
# 3 c003 EU 300 400 15

Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table team is planning on implementing by.x and by.y into the merge.data.table function, but hasn't done so yet.

R merging tables, with different column names and retaining all columns

Yes, that's possible:

second[first, on=c(i2="index", t2="type"), nomatch=0L, .(i2, t2, index, type, value, i.value)]

i2 t2 index type value i.value
1: a 1 a 1 5 3
2: a 2 a 2 6 4
3: b 3 b 3 7 5
4: c 5 c 5 9 7

merging tables with different column names

Using data.table's subset based joins along with the recently implemented on= argument and nomatch=0L, this is simply:

DT2[DT1, on=c(col5="col2", col4="col3"), nomatch=0L]

See the secondary indices vignette for more.


Alternatively if you've the data.tables keyed, then you can skip the on= argument. But the solution above would be idiomatic as it retains the order of original data.tables, and it is clear to tell what columns are being looked up by looking at the code.

setkey(DT1, col2, col3)
setkey(DT2, col5, col4)
DT2[DT1, nomatch=0L]

See history for older versions.

Merging two Datatables with different column names

Probably the problem is caused by the fact that Merge uses the PrimaryKey of the table to find an existing record to update and if it can't find it then add the new record. If this is the case then you should disable the PrimaryKey info retrieved when you have filled the table through the data adapter.

dataTable1.PrimaryKey = Nothing
dataTable2.PrimaryKey = Nothing
dataTable1.Merge(dataTable2, false, MissingSchemaAction.Add)
....

Now Merge cannot find the matches and thus every record in dataTabl2 is added to the dataTable1. However I should warn you to keep an eye on the performances and correctness of other operations on this dataTable1.

Now there is no PrimaryKey set and this could be a source of problems in updating and deleting a row (if you have these operations of course)

Merge two large data.tables based on column name of one table and column value of the other without melting

Using set():

setkey(DT1, "ID")
setkey(DT2, "ID")
for (k in names(DT1)[-1]) {
rows <- which(DT2[["col"]] == k)
set(DT2, i = rows, j = "col_value", DT1[DT2[rows], ..k])
}

ID col col_value
1: A col1 1
2: A col4 13
3: B col2 6
4: B col3 10
5: C col1 3

Note: Setting the key up front speeds up the process but reorders the rows.

Merge two different dataframes on different column names

Well, if you declare column A as index, it works:

Both_DFs = pd.merge(df1.set_index('A', drop=True),df2.set_index('A', drop=True), how='left',left_on=['B'],right_on=['CC'], left_index=True, right_index=True).dropna().reset_index()

This results in:

    A    B   C  BB   CC  DD
0 A1 123 K0 B0 121 D0
1 A1 345 K1 B0 121 D0
2 A3 146 K1 B3 345 D1

EDIT

You just needed:

Both_DFs = pd.merge(df1,df2, how='left',left_on=['A','B'],right_on=['A','CC']).dropna()

Which gives:

    A    B   C  BB   CC  DD
0 A1 121 K0 B0 121 D0

Join data table with different number of rows and column names

(Edited after discussion in the comments)

A dplyr would be something like

library(dplyr)

bind_rows(a, b) %>%
mutate(Fz = coalesce(FzR, FzL)) %>%
select(Fz, limb, time) %>%
group_by(limb) %>%
mutate(time = (seq_along(Fz)-1)*0.001)

In this way the newly created variable time will be a sequence of values from 0 to the number of rows for each limb, multiplied by a factor of 0.001 (so they will be milliseconds). For both limbs L and R time will start at 0.

Output

# A tibble: 18 x 3
# Groups: limb [2]
Fz limb time
<dbl> <chr> <dbl>
1 131. L 0
2 131. L 0.001
3 131. L 0.002
4 131. L 0.003
5 132. L 0.004
6 132. L 0.005
7 132. L 0.006
8 132. L 0.007
9 133. L 0.008
10 133. L 0.009
11 135. R 0
12 131. R 0.001
13 134. R 0.002
14 135. R 0.003
15 136. R 0.004
16 136. R 0.005
17 135. R 0.006
18 135. R 0.007


Related Topics



Leave a reply



Submit