Data.Table Join (Multiple) Selected Columns with New Names

data.table join (multiple) selected columns with new names

Updated answer based on Arun's recommendation:

cols_old <- c('i.a', 'i.b')
DT1[DT2, (cols_new) := mget(cols_old), on = c(id = "id")]

you could also generate the cols_old by doing:

paste0('i.', gsub('_new', '', cols_new, fixed = TRUE))

See history for the old answer.

How to merge two data.table by different column names?

OUTDATED


Use this operation:

X[Y]
# area id value price sales
# 1: US c001 100 500 20
# 2: UK c002 200 200 30
# 3: EU c003 300 400 15

or this operation:

Y[X]
# ID price sales area value
# 1: c001 500 20 US 100
# 2: c002 200 30 UK 200
# 3: c003 400 15 EU 300

Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame but data.table uses merge.data.table. Compare

merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table
# Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") :
# A non-empty vector of column names for `by` is required.

with

merge.data.frame(X, Y, by.x = "id", by.y = "ID")
# id area value price sales
# 1 c001 US 100 500 20
# 2 c002 UK 200 200 30
# 3 c003 EU 300 400 15

Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table team is planning on implementing by.x and by.y into the merge.data.table function, but hasn't done so yet.

R merging tables, with different column names and retaining all columns

Yes, that's possible:

second[first, on=c(i2="index", t2="type"), nomatch=0L, .(i2, t2, index, type, value, i.value)]

i2 t2 index type value i.value
1: a 1 a 1 5 3
2: a 2 a 2 6 4
3: b 3 b 3 7 5
4: c 5 c 5 9 7

Data Table R: Merge selected columns from multiple data.table

Just change the by = "ID" to by = c("ID", "FDR", "logFC") and the argument allow.cartesian should be inside the merge

DT.comb <- Reduce(function(...) merge.data.table(...,
by= c("ID", "FDR", "LogFC"), all = TRUE, allow.cartesian=TRUE), dt.list)

Column name labelling in data.table joins

I ended up answering my own question.

data_table_tidy_join <- function(x,y, join_by){

x <- data.table(x)
y <- data.table(y)

# Determine single join names
single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()

# cols from x that won't require as matching in i
remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)

# names need to keep
x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
y_names_keep_raw <- names(y)

# cols that exist in both x and y, but not being equi joined on
cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]

#rename so indexing works
x_names_keep <- x_names_keep_raw
y_names_keep <- y_names_keep_raw

# give prefix to necessary vars
x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")

# implement data.table call, keeping required cols
joined_data <-
x[y, on = join_by,
mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
as_tibble()

return(joined_data)

}

> x <- data.table(Id = c("A", "B", "C", "C"),
+ X1 = c(1L, 3L, 5L, 7L),
+ X2 = c(8L,12L,9L,18L),
+ XY = c("x2", "x4", "x6", "x8"))
>
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
ID Z1 Z2 X1 X2 XY
<chr> <int> <chr> <int> <int> <chr>
1 C 5 z5 5 9 x6
2 C 6 z6 5 9 x6
3 C 7 z7 5 9 x6
4 C 7 z7 7 18 x8
5 C 8 z8 5 9 x6
6 C 8 z8 7 18 x8
7 C 9 z9 5 9 x6
8 C 9 z9 7 18 x8

Select/Join multiple fields from different tables with same column name

After joining, you can use COALESCE to get the non-null value from the table with a matching row.

$sql = "SELECT k.uaID, k.userID, k.key, k.appName, 
COALESCE(d1.sector, d3.sector, '') AS sector,
COALESCE(d1.subSector, d3.subSector, '') AS subSector,
COALESCE(d1.topic, d2.topic, d3.topic, '') AS topic,
users.title, users.firstName, users.lastName, users.email
FROM keyTable AS k
LEFT OUTER JOIN dataTable01 AS d1 ON k.uaID = d1.uaID
LEFT OUTER JOIN dataTable02 AS d2 ON k.uaID = d2.uaID
LEFT OUTER JOIN dataTable03 AS d3 ON k.uaID = d3.uaID
LEFT OUTER JOIN users ON k.userID = users.userID
ORDER BY k.uaID";

Another way to merge the data from the datatablesNN tables into the same column os tp use UNION.

SELECT k.uaID, k.userID, k.key, k.appName, IFNULL(d.sector, '') AS sector, IFNULL(d.subSector, '') AS subSector, IFNULL(d.topic, '') AS topic,
u.title, u.firstName, u.lastName, u.email
FROM keyTable AS k
LEFT OUTER JOIN (
SELECT uaID, sector, subSector, topic
FROM dataTable01
UNION
SELECT uaID, NULL, NULL, topic
FROM datatable02
UNION
SELECT uaID, sector, subSector, topic
FROM datatable03) AS d
ON k.uaID = d.uaID
LEFT JOIN users AS u ON u.userID = k.userID
ORDER BY k.uaID

DEMO

Joining and selecting multiple tables and creating new column names

You simply JOIN the two tables, and do a group by on the extensionID. Also, add formulas to summarize and gather the info.

SELECT 
`extensionID` AS `Extension`,
SUM(`Duration`) AS `Total Talk Time`,
COUNT(DISTINCT `UniqueCallID`) as `Total Calls`,
SUM(IF(`Answered` = 1,0,1)) AS `Missed Calls`
FROM `callSession` a
JOIN `callSum` b
ON a.`UniqueCallID` = b.`UniqueCallID`
GROUP BY a.`extensionID`
ORDER BY a.`extensionID`

Join in place multiple columns data.table

A possible solution with mget :

cols <- colnames(b)[-1]
data[b,(cols) := mget(cols), on = "ref"][]

ref names radius height weight volume money
<int> <char> <num> <num> <num> <num> <num>
1: 1 a 0.9757437 -1.2441942 -0.19825867 0.7934811 0.6642228
2: 2 b 0.9752424 1.0112435 -1.27333411 0.9296093 -1.1942749
3: 3 c 0.6989610 0.8868158 -1.54361300 -0.4824725 -0.8368856
4: 4 d 0.4336850 0.8949459 1.45199656 -0.3262636 -0.5158351
5: 5 e 0.9989011 0.4237771 0.04365383 -1.4086310 0.7379102
6: 6 f 0.2640407 -2.0085267 -0.15621784 1.3278945 0.2616385

data.table joins - Select all columns in the i argument

How about constructing the j-expression and just eval'ing it?

nc = names(current)[-1L]
nn = paste0("i.", nc)
expr = lapply(nn, as.name)
setattr(expr, 'names', nc)
expr = as.call(c(quote(`:=`), expr))

> current[new[c(1,3)], eval(expr)]
> current
## id var var2
## 1: 1 11 11
## 2: 2 2 2
## 3: 3 13 13
## 4: 4 4 4

Joining tables based on different column names

Update: All the features listed below are implemented and is available in the current stable version of data.table v1.9.6 on CRAN.


There are at least these improvements possible for joins in data.tables.

  • merge.data.table gaining by.x and by.y arguments

  • Using secondary keys to join using both forms discussed above without need to set keys, but rather by specifying columns on x and i.

The simplest reason is that we've not managed to get to it yet.



Related Topics



Leave a reply



Submit