data.table join (multiple) selected columns with new names
Updated answer based on Arun's recommendation:
cols_old <- c('i.a', 'i.b')
DT1[DT2, (cols_new) := mget(cols_old), on = c(id = "id")]
you could also generate the cols_old
by doing:
paste0('i.', gsub('_new', '', cols_new, fixed = TRUE))
See history for the old answer.
How to merge two data.table by different column names?
OUTDATED
Use this operation:
X[Y]
# area id value price sales
# 1: US c001 100 500 20
# 2: UK c002 200 200 30
# 3: EU c003 300 400 15
or this operation:
Y[X]
# ID price sales area value
# 1: c001 500 20 US 100
# 2: c002 200 30 UK 200
# 3: c003 400 15 EU 300
Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge
and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame
but data.table uses merge.data.table
. Compare
merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table
# Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") :
# A non-empty vector of column names for `by` is required.
with
merge.data.frame(X, Y, by.x = "id", by.y = "ID")
# id area value price sales
# 1 c001 US 100 500 20
# 2 c002 UK 200 200 30
# 3 c003 EU 300 400 15
Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table
team is planning on implementing by.x
and by.y
into the merge.data.table
function, but hasn't done so yet.
R merging tables, with different column names and retaining all columns
Yes, that's possible:
second[first, on=c(i2="index", t2="type"), nomatch=0L, .(i2, t2, index, type, value, i.value)]
i2 t2 index type value i.value
1: a 1 a 1 5 3
2: a 2 a 2 6 4
3: b 3 b 3 7 5
4: c 5 c 5 9 7
Data Table R: Merge selected columns from multiple data.table
Just change the by = "ID"
to by = c("ID", "FDR", "logFC")
and the argument allow.cartesian
should be inside the merge
DT.comb <- Reduce(function(...) merge.data.table(...,
by= c("ID", "FDR", "LogFC"), all = TRUE, allow.cartesian=TRUE), dt.list)
Column name labelling in data.table joins
I ended up answering my own question.
data_table_tidy_join <- function(x,y, join_by){
x <- data.table(x)
y <- data.table(y)
# Determine single join names
single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()
# cols from x that won't require as matching in i
remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)
# names need to keep
x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
y_names_keep_raw <- names(y)
# cols that exist in both x and y, but not being equi joined on
cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]
#rename so indexing works
x_names_keep <- x_names_keep_raw
y_names_keep <- y_names_keep_raw
# give prefix to necessary vars
x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")
# implement data.table call, keeping required cols
joined_data <-
x[y, on = join_by,
mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
as_tibble()
return(joined_data)
}
> x <- data.table(Id = c("A", "B", "C", "C"),
+ X1 = c(1L, 3L, 5L, 7L),
+ X2 = c(8L,12L,9L,18L),
+ XY = c("x2", "x4", "x6", "x8"))
>
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
ID Z1 Z2 X1 X2 XY
<chr> <int> <chr> <int> <int> <chr>
1 C 5 z5 5 9 x6
2 C 6 z6 5 9 x6
3 C 7 z7 5 9 x6
4 C 7 z7 7 18 x8
5 C 8 z8 5 9 x6
6 C 8 z8 7 18 x8
7 C 9 z9 5 9 x6
8 C 9 z9 7 18 x8
Select/Join multiple fields from different tables with same column name
After joining, you can use COALESCE
to get the non-null value from the table with a matching row.
$sql = "SELECT k.uaID, k.userID, k.key, k.appName,
COALESCE(d1.sector, d3.sector, '') AS sector,
COALESCE(d1.subSector, d3.subSector, '') AS subSector,
COALESCE(d1.topic, d2.topic, d3.topic, '') AS topic,
users.title, users.firstName, users.lastName, users.email
FROM keyTable AS k
LEFT OUTER JOIN dataTable01 AS d1 ON k.uaID = d1.uaID
LEFT OUTER JOIN dataTable02 AS d2 ON k.uaID = d2.uaID
LEFT OUTER JOIN dataTable03 AS d3 ON k.uaID = d3.uaID
LEFT OUTER JOIN users ON k.userID = users.userID
ORDER BY k.uaID";
Another way to merge the data from the datatablesNN
tables into the same column os tp use UNION
.
SELECT k.uaID, k.userID, k.key, k.appName, IFNULL(d.sector, '') AS sector, IFNULL(d.subSector, '') AS subSector, IFNULL(d.topic, '') AS topic,
u.title, u.firstName, u.lastName, u.email
FROM keyTable AS k
LEFT OUTER JOIN (
SELECT uaID, sector, subSector, topic
FROM dataTable01
UNION
SELECT uaID, NULL, NULL, topic
FROM datatable02
UNION
SELECT uaID, sector, subSector, topic
FROM datatable03) AS d
ON k.uaID = d.uaID
LEFT JOIN users AS u ON u.userID = k.userID
ORDER BY k.uaID
DEMO
Joining and selecting multiple tables and creating new column names
You simply JOIN the two tables, and do a group by on the extensionID. Also, add formulas to summarize and gather the info.
SELECT
`extensionID` AS `Extension`,
SUM(`Duration`) AS `Total Talk Time`,
COUNT(DISTINCT `UniqueCallID`) as `Total Calls`,
SUM(IF(`Answered` = 1,0,1)) AS `Missed Calls`
FROM `callSession` a
JOIN `callSum` b
ON a.`UniqueCallID` = b.`UniqueCallID`
GROUP BY a.`extensionID`
ORDER BY a.`extensionID`
Join in place multiple columns data.table
A possible solution with mget
:
cols <- colnames(b)[-1]
data[b,(cols) := mget(cols), on = "ref"][]
ref names radius height weight volume money
<int> <char> <num> <num> <num> <num> <num>
1: 1 a 0.9757437 -1.2441942 -0.19825867 0.7934811 0.6642228
2: 2 b 0.9752424 1.0112435 -1.27333411 0.9296093 -1.1942749
3: 3 c 0.6989610 0.8868158 -1.54361300 -0.4824725 -0.8368856
4: 4 d 0.4336850 0.8949459 1.45199656 -0.3262636 -0.5158351
5: 5 e 0.9989011 0.4237771 0.04365383 -1.4086310 0.7379102
6: 6 f 0.2640407 -2.0085267 -0.15621784 1.3278945 0.2616385
data.table joins - Select all columns in the i argument
How about constructing the j-expression
and just eval
'ing it?
nc = names(current)[-1L]
nn = paste0("i.", nc)
expr = lapply(nn, as.name)
setattr(expr, 'names', nc)
expr = as.call(c(quote(`:=`), expr))
> current[new[c(1,3)], eval(expr)]
> current
## id var var2
## 1: 1 11 11
## 2: 2 2 2
## 3: 3 13 13
## 4: 4 4 4
Joining tables based on different column names
Update: All the features listed below are implemented and is available in the current stable version of data.table v1.9.6
on CRAN.
There are at least these improvements possible for joins in data.tables.
merge.data.table
gainingby.x
andby.y
argumentsUsing secondary keys to join using both forms discussed above without need to set keys, but rather by specifying columns on
x
andi
.
The simplest reason is that we've not managed to get to it yet.
Related Topics
How to Get The R Shiny Downloadhandler Filename to Work
Rselenium on Docker: Where Are Files Downloaded
Initialize a List of Matrices in R
Summing Multiple Columns in an R Data-Frame Quickly
Prevent Selectinput from Wrapping Text
Conda Build R Package Fails at C Compiler Issue on Macos Mojave
Verify Object Existence Inside a Function in R
How to Plot Multiple Lines in R
Meaning of Error Using . Shorthand Inside Dplyr Function
Run R Interactively from Rscript
All Paths in Directed Tree Graph from Root to Leaves in Igraph R
Error in Xj[I]: Invalid Subscript Type 'List'
Find Max Per Group and Return Another Column
Standard Error of Variance Component from The Output of Lmer
How to Rotate 3D Plotly Continuous for R Shiny App