How to perform multiple left joins using dplyr in R
You can use a nested left_join
library(dplyr)
left_join(x, y, by='Flag') %>%
left_join(., z, by='Flag')
Or another option would be to place all the datasets in a list
and use merge
from base R
with Reduce
Reduce(function(...) merge(..., by='Flag', all.x=TRUE), list(x,y,z))
Or we have join_all
from plyr
. Here we place the dataframes in a list
and use the argument type='left'
for a left join.
library(plyr)
join_all(list(x,y,z), by='Flag', type='left')
As @JBGruber mentioned in the comments, it can also be done via purrr
library(purrr)
library(dplyr)
purrr::reduce(list(x,y,z), dplyr::left_join, by = 'Flag')
How to join multiple data frames using dplyr?
It's been too late i know....today I got introduced to the unanswered questions section. Sorry to bother.
Using left_join()
dfs <- list(
df1 = data.frame(b = c("a", "b", "c"), a = 1:3),
df2 = data.frame(d = c("a", "c", "d"), c = 4:6),
df3 = data.frame(b = c("b", "c", "e"), d = 7:9)
)
func <- function(...){
df1 = list(...)[[1]]
df2 = list(...)[[2]]
col1 = colnames(df1)[1]
col2 = colnames(df2)[1]
xxx = left_join(..., by = setNames(col2,col1))
return(xxx)
}
Reduce( func, dfs)
# b a c d
#1 a 1 4 NA
#2 b 2 NA 7
#3 c 3 5 8
Using merge()
:
func <- function(...){
df1 = list(...)[[1]]
df2 = list(...)[[2]]
col1 = colnames(df1)[1]
col2 = colnames(df2)[1]
xxx=merge(..., by.x = col1, by.y = col2, , all.x = T)
return(xxx)
}
Reduce( func, dfs)
# b a c d
#1 a 1 4 NA
#2 b 2 NA 7
#3 c 3 5 8
Left join with multiple conditions in R
You can join on more than one variable. The example df you give would actually make a suitable lookup table for this:
value_lookup <- data.frame(
type = c('q1', 'q1', 'q2', 'q2', 'q3', 'q3'),
id = c(1, 2, 1, 3, 1, 2),
value = c('yes', 'no', 'one hour', 'more than two hours', 'blue', 'yellow')
)
Then you just merge on both type
and id
:
df <- left_join(df, value_lookup, by = c('type', 'id'))
Usually when I need a lookup table like that I store it in a CSV rather than write it all out in the code, but do whatever suits you.
How to use left_join on several data frames?
Using purrr::reduce to merge multiple dataframes, then use dplyr::coalesce to get first non-na value:
library(dplyr)
library(purrr)
list(table_base, table_a, table_b) %>%
reduce(left_join, by = "cat") %>%
mutate(value = coalesce(!!!select(., starts_with("value")))) %>%
select(cat, value)
# cat value
# 1 a 1
# 2 b 2
# 3 c 9
# 4 d 10
Is there a clean dplyr-way of doing multiple left-(self)joins?
Following the suggestion by @zx8754 one option to achieve your desired result would be to do the left_joins
via a recursive function which stops when there are no more matches:
library(dplyr)
library(tidyr)
test_hierarchie <- tribble(
~child, ~parent,
"A", "B",
"B", "C",
"D", "E"
)
left_join_recursive <- function(x, by) {
x <- left_join(x, test_hierarchie, by = setNames("child", by), suffix = c("", "_grant"))
byby <- paste0(by, "_grant")
if (!all(is.na(x[[byby]]))) {
left_join_recursive(x, byby)
} else {
x
}
}
test_hierarchie_transformed <- left_join_recursive(test_hierarchie, "parent") %>%
pivot_longer(names_to = "relation", cols = contains("parent"), values_to = "parent") %>%
filter(!is.na(parent))
test_hierarchie_transformed
#> # A tibble: 4 × 3
#> child relation parent
#> <chr> <chr> <chr>
#> 1 A parent B
#> 2 A parent_grant C
#> 3 B parent C
#> 4 D parent E
To check wether the approach works in a more general case I added another row to your example data:
test_hierarchie <- add_row(test_hierarchie, child = "C", parent = "D")
test_hierarchie_transformed <- left_join_recursive(test_hierarchie, "parent") %>%
pivot_longer(names_to = "relation", cols = contains("parent"), values_to = "parent") %>%
filter(!is.na(parent))
test_hierarchie_transformed
#> # A tibble: 10 × 3
#> child relation parent
#> <chr> <chr> <chr>
#> 1 A parent B
#> 2 A parent_grant C
#> 3 A parent_grant_grant D
#> 4 A parent_grant_grant_grant E
#> 5 B parent C
#> 6 B parent_grant D
#> 7 B parent_grant_grant E
#> 8 D parent E
#> 9 C parent D
#> 10 C parent_grant E
Merging two dataframe with dplyr left join?
The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could
- unite
Parent.MeSH.ID
andChild.MeSH.ID
into a new columnMeSH_ID
- split the united columns in separate IDs using e.g.
tidyr::separate_rows
. Doing so makes it possible to join the df's by ID. - Use an
semi_join
to filter out rows in df1 with matches in the newly created df3, finally do aleft_join
to add the columns from df3. Or if doesn't matter to keep bothHUGO_symbol
andGene.Name
you could achieve both steps with aninner_join
.
df1 <- data.frame(
stringsAsFactors = FALSE,
HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
MeSH_ID = c("D000310", "D0002277", "D000230")
)
df2 <- data.frame(
stringsAsFactors = FALSE,
Gene.Name = c("P53", "HGA2", "ZZZ3"),
Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)
library(dplyr)
library(tidyr)
df3 <- df2 %>%
unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>%
separate_rows(MeSH_ID, sep = ", ")
semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>%
left_join(df3)
#> Joining, by = "MeSH_ID"
#> HUGO_symbol MeSH_ID Gene.Name Parent.MeSH.ID Child.MeSH.ID
#> 1 P53 D000310 P53 D000310 D015675, D006676
#> 2 ZZZ3 D000230 ZZZ3 D001163, D000230 D003451
How to perform left join using multiple columns where one data frame has missingness in one column?
An option using data.table
:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on=.(id1, id2), mult="first", val2 := val2]
df1[is.na(val2), val2 :=
df2[.SD, on=.(id1), mult="first", val2]]
I have taken the liberty of using the first value if there are multiple joins (i.e. the combination of id1 and id2 in df2 are not unique).
Related Topics
Converting a \U Escaped Unicode String to Ascii
Plotly: Updating Data with Dropdown Selection
How to Extract Certain Columns from a List of Data Frames
Cbind 2 Dataframes with Different Number of Rows
Convert a Character Vector of Mixed Numbers, Fractions, and Integers to Numeric
Combining 'Expression()' with '\N'
Complete Column with Group_By and Complete
Replace Duplicated Elements with Na, Instead of Removing Them
Set Ggplot Plots to Have Same X-Axis Width and Same Space Between Dot Plot Rows
How to Export S3 Method So It Is Available in Namespace
How to Use Data.Table Within Functions and Loops
In R, How to Add a Max by Group
How to Change Order of Array Dimensions
How to Change the Figure Caption Format in Bookdown
Conditional Binary Join and Update by Reference Using the Data.Table Package