join matching columns in a data.frame or data.table
The type of merge you specify probably won't be possible using merge
(with data frames), although saying that usually invites being proved wrong.
You also omit some details: will there always be a single unique non-NA
value in each column for each id
value? If so, this will work:
ab <- rbind(a,b)
> colFun <- function(x){x[which(!is.na(x))]}
> ddply(ab,.(id),function(x){colwise(colFun)(x)})
id v1 v2
1 1 a A
2 2 B b
3 3 C c
A similar strategy should work with data.table
s as well:
abDT <- data.table(ab,key = "id")
> abDT[,list(colFun(v1),colFun(v2)),by = id]
id V1 V2
[1,] 1 a A
[2,] 2 B b
[3,] 3 C c
How to join (merge) data frames (inner, outer, left, right)
By using the merge
function and its optional parameters:
Inner join: merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use the by.x
and by.y
parameters if the matching variables have different names in the different data frames.
Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join: merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.
You can merge on multiple columns by giving by
a vector, e.g., by = c("CustomerId", "OrderId")
.
If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
where CustomerId_in_df1
is the name of the column in the first data frame and CustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)
R Populate column based on matching rows values in two different data frames
Start with the detect
column only in df2
, then merge:
df1$detect = NULL
df2$detect = 1
result = merge(df1, unique(df2), all.x = TRUE)
This will create the detect
column as 1s when there are exact matches and NA
s when there are not. If you want, you can change the NA
s to 0s.
The same method can work with dplyr
:
library(dplyr)
df1 %>%
select(-detect) %>%
left_join(
df2 %>% mutate(detect = 1) %>% unique)
)
Merge R data frame or data table and overwrite values of multiple columns
You can do this by using dplyr::coalesce
, which will return the first non-missing value from vectors.
(EDIT: you can use dplyr::coalesce
directly on the data frames also, no need to create the function below. Left it there just for completeness, as a record of the original answer.)
Credit where it's due: this code is mostly from this blog post, it builds a function that will take two data frames and do what you need (taking values from the x
data frame if they are present).
coalesce_join <- function(x,
y,
by,
suffix = c(".x", ".y"),
join = dplyr::full_join, ...) {
joined <- join(x, y, by = by, suffix = suffix, ...)
# names of desired output
cols <- union(names(x), names(y))
to_coalesce <- names(joined)[!names(joined) %in% cols]
suffix_used <- suffix[ifelse(endsWith(to_coalesce, suffix[1]), 1, 2)]
# remove suffixes and deduplicate
to_coalesce <- unique(substr(
to_coalesce,
1,
nchar(to_coalesce) - nchar(suffix_used)
))
coalesced <- purrr::map_dfc(to_coalesce, ~dplyr::coalesce(
joined[[paste0(.x, suffix[1])]],
joined[[paste0(.x, suffix[2])]]
))
names(coalesced) <- to_coalesce
dplyr::bind_cols(joined, coalesced)[cols]
}
How do I combine two data-frames based on two columns?
See the documentation on ?merge
, which states:
By default the data frames are merged on the columns with names they both have,
but separate specifications of the columns can be given by by.x and by.y.
This clearly implies that merge
will merge data frames based on more than one column. From the final example given in the documentation:
x <- data.frame(k1=c(NA,NA,3,4,5), k2=c(1,NA,NA,4,5), data=1:5)
y <- data.frame(k1=c(NA,2,NA,4,5), k2=c(NA,NA,3,4,5), data=1:5)
merge(x, y, by=c("k1","k2")) # NA's match
This example was meant to demonstrate the use of incomparables
, but it illustrates merging using multiple columns as well. You can also specify separate columns in each of x
and y
using by.x
and by.y
.
merging data frames based on multiple nearest matches in R
Without knowing exactly how you want the result formatted, you can do this with the data.table rolling join with roll="nearest"
that you mentioned.
In this case I've melt
ed both sets of data to long datasets so that the matching can be done in a single join.
library(data.table)
setDT(df1)
setDT(df2)
df1[
match(
melt(df1, id.vars="julian")[
melt(df2, measure.vars=names(df2)),
on=c("variable","value"), roll="nearest"]$julian,
julian),
]
# julian a b c d
#1: 9 12.02948 13.54714 7.659482 6.784113
#2: 20 28.74620 20.24871 18.523935 17.801711
#3: 10 13.00511 14.57352 8.296155 6.942622
#4: 24 30.26931 24.20554 20.253149 22.017714
If you want separate tables for each join instead you could do something like:
lapply(names(df2), \(var) df1[df2, on=var, roll="nearest", .SD, .SDcols=names(df1)] )
How to join data to only the first matching row with {data.table} in R
One way would be to turn the values to NA
after join.
library(data.table)
d3 <- d2[d1, on = c("a", "b")]
d3[, d:= replace(d, seq_len(.N) != 1, NA), .(a, b)]
d3
# a b d c
#1: 1 1 TRUE 4
#2: 1 1 NA 8
#3: 1 2 NA 2
merge two data table into one, with alternating columns in R
You can just cbind
them and then re-order the columns:
neworder <- order(c(2*(seq_along(odd_data) - 1) + 1,
2*seq_along(even_data)))
cbind(odd_data, even_data)[,neworder]
# col_1 col_2 col_3 col_4
# 1: 11 12 13 14
# 2: 21 22 23 24
# 3: 31 32 33 34
Explanation:
### count by odds
2*(seq_along(odd_data) - 1) + 1
# [1] 1 3
### count by evens
2*seq_along(even_data)
# [1] 2 4
neworder
# [1] 1 3 2 4
This gives us the column order we want in the end: first column (col_1
), third column (col_2
, since it is after all columns of the first table), etc.
To test, we can generate two asymmetric examples:
odd_data = data.table(col_1 = c(11, 21, 31),
col_3 = c(13, 23, 33),
col_5 = c(15, 25, 35))
even_data = data.table(col_2 = c(12, 22, 32),
col_4 = c(14, 24, 34))
neworder <- order(c(2*(seq_along(odd_data) - 1) + 1,
2*seq_along(even_data)))
cbind(odd_data, even_data)[,neworder]
# col_1 col_2 col_3 col_4 col_5
# 1: 11 12 13 14 15
# 2: 21 22 23 24 25
# 3: 31 32 33 34 35
Next, 3 and 3:
odd_data = data.table(col_1 = c(11, 21, 31),
col_3 = c(13, 23, 33),
col_5 = c(15, 25, 35))
even_data = data.table(col_2 = c(12, 22, 32),
col_4 = c(14, 24, 34),
col_6 = c(16, 26, 36))
neworder <- order(c(2*(seq_along(odd_data) - 1) + 1,
2*seq_along(even_data)))
cbind(odd_data, even_data)[,neworder]
# col_1 col_2 col_3 col_4 col_5 col_6
# 1: 11 12 13 14 15 16
# 2: 21 22 23 24 25 26
# 3: 31 32 33 34 35 36
Now if we want to try to mess up the system by having more evens than odds (which "shouldn't" happen):
odd_data = data.table(col_1 = c(11, 21, 31),
col_3 = c(13, 23, 33),
col_5 = c(15, 25, 35))
even_data = data.table(col_2 = c(12, 22, 32),
col_4 = c(14, 24, 34),
col_6 = c(16, 26, 36),
col_8 = c(18, 28, 38))
neworder <- order(c(2*(seq_along(odd_data) - 1) + 1,
2*seq_along(even_data)))
cbind(odd_data, even_data)[,neworder]
# col_1 col_2 col_3 col_4 col_5 col_6 col_8
# 1: 11 12 13 14 15 16 18
# 2: 21 22 23 24 25 26 28
# 3: 31 32 33 34 35 36 38
So while col_8
is not technically the 8th column, the order of all other columns is still preserved.
Related Topics
Subset Data Based on Partial Match of Column Names
How to Increase Stack Space Overflow for Pandoc in R
How to Store R Ggplot Graph as HTML Code Snippet
How to Manually Set Geom_Bar Fill Color in Ggplot
How to Change Strip.Text Labels in Ggplot with Facet and Margin=True
Can Lapply Not Modify Variables in a Higher Scope
Scoping and Functions in R 2.11.1:What's Going Wrong
Lme4::Glmer VS. Stata's Melogit Command
Tm: Read in Data Frame, Keep Text Id'S, Construct Dtm and Join to Other Dataset
Count Number of Vector Values in Range with R
Error in Plot, Formula Missing When Using Svm
Getting a Slot's Value of S4 Objects
How Does Branch Prediction Affect Performance in R
Object.Size() Reports Smaller Size Than .Rdata File
How to Jitter Two Ggplot Geoms in the Same Way
Format Latitude and Longitude Axis Labels in Ggplot
How to Edit and Save Changes Made on Shiny Datatable Using Dt Package