How to do a data.table merge operation
You are quoting the wrong part of documentation. If you have a look at the doc of [.data.table
you will read:
When i is a data.table, x must have a
key, meaning join i to x and return
the rows in x that match. An equi-join
is performed between each column in i
to each column in x’s key in order.
This is similar to base R
functionality of sub- setting a matrix
by a 2-column matrix, and in higher
dimensions subsetting an n-dimensional
array by an n-column matrix
I admit the description of the package (the part you quoted) is somewhat confusing, because it seems to say that the "["-operation can be used instead of merge. But I think what it says is: if x and y are both data.tables we use a join on an index (which is invoked like merge) instead of binary search.
One more thing:
The data.table library I installed via install.packages
was missing the merge.data.table method
, so using merge
would call merge.data.frame
. After installing the package from R-Forge R used the faster merge.data.table
method.
You can check if you have the merge.data.table method by checking the output of:
methods(generic.function="merge")
EDIT [Answer no longer valid]: This answer refers to data.table version 1.3. In version 1.5.3 the behaviour of data.table changed and x[y] returns the expected results. Thank you Matthew Dowle, author of data.table, for pointing this out in the comments.
Chaining multiple data.table::merge operations with data.tables
Multiple data.table joins with the on
argument can be chained. Note that without an update operator (":=") in j, this would be a right join, but with ":=" (i.e., adding columns), this becomes a left outer join. A useful post on left joins here Left join using data.table.
Example using example data above with a subset between joins:
dt4 <- dt1[dt2, on="food", `:=`(status = i.status)][
food == "apples"][dt3, on="food", rank := i.rank]
##> dt4
## food quantity status rank
##1: apples 1 good okay
Example adding new column between joins
dt4 <- dt1[dt2, on="food", `:=`(status = i.status)][
, new_col := NA][dt3, on="food", rank := i.rank]
##> dt4
## food quantity status new_col rank
##1: apples 1 good NA okay
##2: bananas 2 bad NA good
##3: carrots 3 rotten NA better
##4: dates 4 raw NA best
Example using merge
and magrittr pipes:
dt4 <- merge(dt1, dt2, by = "food") %>%
set( , "new_col", NA) %>%
merge(dt3, by = "food")
##> dt4
## food quantity status new_col rank
##1: apples 1 good NA okay
##2: bananas 2 bad NA good
##3: carrots 3 rotten NA better
##4: dates 4 raw NA best
R data.table: How can I merge a list of data.tables?
We can use join on
library(data.table)
na.omit(Reduce(function(x, y) x[y, on = .(V1)], dtl))
A merge indicator for R data.table?
You can use merge.data.table
with all=TRUE
for a full outer join:
library(data.table)
setDT(df1)
setDT(df2)
DT <- merge(df1[, r1 := .I], df2[, r2 := .I], by="key", all=TRUE)
DT[, merge_ := "both"][
is.na(r1), merge_ := "right_only"][
is.na(r2), merge_ := "left_only"]
output:
key v1 r1 v2 r2 merge_
1: a 1 1 4 1 both
2: b 2 2 5 2 both
3: c 3 3 NA NA left_only
4: d NA NA 6 3 right_only
data:
key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key=key1,v1)
df2 = data.frame(key=key2,v2)
As mentioned by Michael Chirico, with data.table_1.13.0
released on Jul 24, 2020, one can also use fcase
as follows:
DT[, merge_ := fcase(
is.na(r1), "right_only",
is.na(r2), "left_only",
default = "both"
)]
Data.Table Full Outer Join in R without key
What you are looking for is a cross join
setkey(dt1[,c(k=1,.SD)],k) [ dt2[,c(k=1,.SD)], allow.cartesian=TRUE ][,k:=NULL]
Merging two sets of data by data.table roll='nearest' function
Here is a step-by-step example based on the sample data you give:
# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)
# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]
unique(set_merged[order(ID_b)], by = "time")
# ID_b b1 b2 source time_b time ID_a a1 a2 a3
# 1: 2 34.2 15.114 set1.csv.1 20.35750 20.01000 8 85640 5274.1 301.6041
# 2: 7 67.2 16.114 set1.csv.2 21.35778 21.00972 7 85697 5345.2 301.6043
# 3: 12 12.2 33.114 set1.csv.3 22.35806 22.00972 4 65694 9375.2 301.6049
# 4: 17 73.2 67.114 set2.csv.1 23.35833 23.00972 3 85694 9278.9 301.6051
# 5: 23 88.2 42.114 set2.csv.2 19.35861 19.00972 5 85653 4375.5 301.6047
# 6: 28 90.2 52.114 set3.csv.1 0.35889 0.00944 2 35694 5245.2 301.6053
# time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6: 0.00944
Two comments:
- We create a new
time
column to avoid losing one of the original time columns fromset_A
andset_B
. You can always remove thetime
column after the join if required. - We use
unique
to remove duplicatedtime
rows by order ofID_b
. You mention in your post that "it doesn't really matter which row will be merged" but in case that you do want to retain specific rows, you might need to adjust this line of code.
Update (thanks to @Henrik)
As @Henrik pointed out, what you're after is actually a rolling join of set_A
with respect to set_B
, in which case you don't need to deal with the duplicate rows.
That translates to
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
set_A[set_B, on = "time", roll = "nearest"][order(ID_a)]
# ID_a a1 a2 a3 time_a time ID_b b1 b2 source
#1: 2 35694 5245.2 301.6053 0.00944 0.35889 28 90.2 52.114 set3.csv.1
#2: 3 85694 9278.9 301.6051 23.00972 23.35833 17 73.2 67.114 set2.csv.1
#3: 5 85653 4375.5 301.6047 19.00972 19.35861 23 88.2 42.114 set2.csv.2
#4: 6 12694 5236.3 301.6045 22.00972 22.35806 12 12.2 33.114 set1.csv.3
#5: 7 85697 5345.2 301.6043 21.00972 21.35778 7 67.2 16.114 set1.csv.2
#6: 9 30694 5279.0 301.6039 20.01000 20.35750 2 34.2 15.114 set1.csv.1
# time_b
#1: 0.35889
#2: 23.35833
#3: 19.35861
#4: 22.35806
#5: 21.35778
#6: 20.35750
Sample data
set_A <- read.table(text =
"ID_a a1 a2 a3 time_a
2 35694 5245.2 301.6053 00.00944
3 85694 9278.9 301.6051 23.00972
4 65694 9375.2 301.6049 22.00972
5 85653 4375.5 301.6047 19.00972
6 12694 5236.3 301.6045 22.00972
7 85697 5345.2 301.6043 21.00972
8 85640 5274.1 301.6041 20.01000
9 30694 5279.0 301.6039 20.01000", header = T)
set_B <- read.table(text =
"ID_b b1 b2 source time_b
2 34.20 15.114 set1.csv.1 20.35750
7 67.20 16.114 set1.csv.2 21.35778
12 12.20 33.114 set1.csv.3 22.35806
17 73.20 67.114 set2.csv.1 23.35833
23 88.20 42.114 set2.csv.2 19.35861
28 90.20 52.114 set3.csv.1 00.35889", header = T)
Pandas Merge Operation on two different dataTable?
I think problem is columns EBELN
and BSTNK
has different dtypes
in both DataFrames
so no match.
Check it by:
print (sellOrder['EBELN'].dtype)
print (purchaseOrder['BSTNK'].dtype)
So need convert both columns to int
or both to str
.
sellOrder['EBELN'] = sellOrder['EBELN'].astype(int)
purchaseOrder['BSTNK'] = purchaseOrder['BSTNK'].astype(int)
Or:
sellOrder['EBELN'] = sellOrder['EBELN'].astype(str)
purchaseOrder['BSTNK'] = purchaseOrder['BSTNK'].astype(str)
Merge two large data.tables based on column name of one table and column value of the other without melting
Using set()
:
setkey(DT1, "ID")
setkey(DT2, "ID")
for (k in names(DT1)[-1]) {
rows <- which(DT2[["col"]] == k)
set(DT2, i = rows, j = "col_value", DT1[DT2[rows], ..k])
}
ID col col_value
1: A col1 1
2: A col4 13
3: B col2 6
4: B col3 10
5: C col1 3
Note: Setting the key up front speeds up the process but reorders the rows.
data.table merge on partial match of different columns in R
Given the dimension of the problem (DT1 [(1:50,000), (1:25)] - DT2[(1:50,000,000), (1:55)]), it is probably infeasible to do a CJ of the IDs before doing a two-way grepl
.
Breaking down the different kind of matches/approx. matches, we can 1) first look for exact matches, 2) then approx. matches where substring in DT1 can be found in DT2 and then, 3) vice versa.
Finally, we row bind all the results and do a left join between original DT2 and the row-binded results to get desired output.
exactMatches <- DT1[DT2, on=c("ID1"="ID2"), nomatch=0L][,
ID2 := ID1]
substr1in2 <- DT2[, c(.SD, DT1[grepl(ID2, ID1) & ID1 != ID2]),
by=1:DT2[,.N]][!is.na(VAL1), -1L]
substr2in1 <- DT1[, c(.SD, DT2[grepl(ID1, ID2) & ID2 != ID1]),
by=1:DT1[,.N]][!is.na(VAL2), -1L]
binded <- rbindlist(list(exactMatches, substr1in2, substr2in1),
use.names=TRUE, fill=TRUE)
binded[DT2, on=.(ID2, VAL2)]
output:
ID1 VAL1 VAL2 ID2
1: BA 4 0 BA
2: ABC 5 5 ABC
3: ABC BC 3 5 ABC
4: AB 2 5 ABC
5: <NA> NA 2 DC
6: AA 1 7 AA
7: AB 2 1 AB
8: ABC 5 1 AB
9: ABC BC 3 1 AB
10: AB 2 0 R AB
I changed some of the column names to make the code more readable. Data:
DT1 <- data.table(ID1 = c("AA", "BA", "ABC", "ABC BC", "AB"),
VAL1 = c(1,4,5,3,2))
DT2 <- data.table(VAL2 = c(0,5,2,7,1,0),
ID2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))
Related Topics
Shift Legend into Empty Facets of a Faceted Plot in Ggplot2
How to Automatically Include All 2-Way Interactions in a Glm Model in R
How to Add an Inset (Subplot) to "Topright" of an R Plot
Does Roxygen2 Automatically Write Namespace Directives for "Imports:" Packages
Suppress Messages Displayed by "Print" Instead of "Message" or "Warning" in R
Using Legend with Stat_Function in Ggplot2
Number Formatting Axis Labels in Ggplot2
How to Extract Substring Between Patterns "_" and "." in R
Installation of Rodbc on Os X Yosemite
How to Remove Columns with Same Value in R
Get the Column Number in R Given the Column Name
R Not Finding Package Even After Package Installation
How to Compute Roc and Auc Under Roc After Training Using Caret in R
R List Files with Multiple Conditions
How Make 2 Column Layout in R Markdown When Rendering PDF
Using ':=' in Data.Table to Sum the Values of Two Columns in R, Ignoring Nas