Join Matching Columns in a Data.Frame or Data.Table

join matching columns in a data.frame or data.table

The type of merge you specify probably won't be possible using merge (with data frames), although saying that usually invites being proved wrong.

You also omit some details: will there always be a single unique non-NA value in each column for each id value? If so, this will work:

ab <- rbind(a,b)
> colFun <- function(x){x[which(!is.na(x))]}
> ddply(ab,.(id),function(x){colwise(colFun)(x)})
id v1 v2
1 1 a A
2 2 B b
3 3 C c

A similar strategy should work with data.tables as well:

abDT <- data.table(ab,key = "id")
> abDT[,list(colFun(v1),colFun(v2)),by = id]
id V1 V2
[1,] 1 a A
[2,] 2 B b
[3,] 3 C c

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

R Populate column based on matching rows values in two different data frames

Start with the detect column only in df2, then merge:

df1$detect = NULL
df2$detect = 1
result = merge(df1, unique(df2), all.x = TRUE)

This will create the detect column as 1s when there are exact matches and NAs when there are not. If you want, you can change the NAs to 0s.

The same method can work with dplyr:

library(dplyr)
df1 %>%
select(-detect) %>%
left_join(
df2 %>% mutate(detect = 1) %>% unique)
)

Merge R data frame or data table and overwrite values of multiple columns

You can do this by using dplyr::coalesce, which will return the first non-missing value from vectors.

(EDIT: you can use dplyr::coalesce directly on the data frames also, no need to create the function below. Left it there just for completeness, as a record of the original answer.)

Credit where it's due: this code is mostly from this blog post, it builds a function that will take two data frames and do what you need (taking values from the x data frame if they are present).

coalesce_join <- function(x, 
y,
by,
suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))

to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))

coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce

dplyr::bind_cols(joined, coalesced)[cols]
}

How do I combine two data-frames based on two columns?

See the documentation on ?merge, which states:

By default the data frames are merged on the columns with names they both have, 
but separate specifications of the columns can be given by by.x and by.y.

This clearly implies that merge will merge data frames based on more than one column. From the final example given in the documentation:

x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match

This example was meant to demonstrate the use of incomparables, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x and y using by.x and by.y.

merging data frames based on multiple nearest matches in R

Without knowing exactly how you want the result formatted, you can do this with the data.table rolling join with roll="nearest" that you mentioned.

In this case I've melted both sets of data to long datasets so that the matching can be done in a single join.

library(data.table)
setDT(df1)
setDT(df2)

df1[
match(
melt(df1, id.vars="julian")[
melt(df2, measure.vars=names(df2)),
on=c("variable","value"), roll="nearest"]$julian,
julian),
]
# julian a b c d
#1: 9 12.02948 13.54714 7.659482 6.784113
#2: 20 28.74620 20.24871 18.523935 17.801711
#3: 10 13.00511 14.57352 8.296155 6.942622
#4: 24 30.26931 24.20554 20.253149 22.017714

If you want separate tables for each join instead you could do something like:

lapply(names(df2), \(var)  df1[df2, on=var, roll="nearest", .SD, .SDcols=names(df1)] )

How to join data to only the first matching row with {data.table} in R

One way would be to turn the values to NA after join.

library(data.table)

d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3

# a b d c
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2

merge two data table into one, with alternating columns in R

You can just cbind them and then re-order the columns:

neworder <- order(c(2*(seq_along(odd_data) - 1) + 1,
2*seq_along(even_data)))
cbind(odd_data, even_data)[,neworder]
# col_1 col_2 col_3 col_4
# 1: 11 12 13 14
# 2: 21 22 23 24
# 3: 31 32 33 34

Explanation:

### count by odds
2*(seq_along(odd_data) - 1) + 1
# [1] 1 3

### count by evens
2*seq_along(even_data)
# [1] 2 4

neworder
# [1] 1 3 2 4

This gives us the column order we want in the end: first column (col_1), third column (col_2, since it is after all columns of the first table), etc.

To test, we can generate two asymmetric examples:

odd_data = data.table(col_1 = c(11, 21, 31),
col_3 = c(13, 23, 33),
col_5 = c(15, 25, 35))
even_data = data.table(col_2 = c(12, 22, 32),
col_4 = c(14, 24, 34))
neworder <- order(c(2*(seq_along(odd_data) - 1) + 1,
2*seq_along(even_data)))
cbind(odd_data, even_data)[,neworder]
# col_1 col_2 col_3 col_4 col_5
# 1: 11 12 13 14 15
# 2: 21 22 23 24 25
# 3: 31 32 33 34 35

Next, 3 and 3:

odd_data = data.table(col_1 = c(11, 21, 31),
col_3 = c(13, 23, 33),
col_5 = c(15, 25, 35))
even_data = data.table(col_2 = c(12, 22, 32),
col_4 = c(14, 24, 34),
col_6 = c(16, 26, 36))

neworder <- order(c(2*(seq_along(odd_data) - 1) + 1,
2*seq_along(even_data)))
cbind(odd_data, even_data)[,neworder]
# col_1 col_2 col_3 col_4 col_5 col_6
# 1: 11 12 13 14 15 16
# 2: 21 22 23 24 25 26
# 3: 31 32 33 34 35 36

Now if we want to try to mess up the system by having more evens than odds (which "shouldn't" happen):

odd_data = data.table(col_1 = c(11, 21, 31),
col_3 = c(13, 23, 33),
col_5 = c(15, 25, 35))
even_data = data.table(col_2 = c(12, 22, 32),
col_4 = c(14, 24, 34),
col_6 = c(16, 26, 36),
col_8 = c(18, 28, 38))

neworder <- order(c(2*(seq_along(odd_data) - 1) + 1,
2*seq_along(even_data)))
cbind(odd_data, even_data)[,neworder]
# col_1 col_2 col_3 col_4 col_5 col_6 col_8
# 1: 11 12 13 14 15 16 18
# 2: 21 22 23 24 25 26 28
# 3: 31 32 33 34 35 36 38

So while col_8 is not technically the 8th column, the order of all other columns is still preserved.



Related Topics



Leave a reply



Submit