How to Perform Multiple Left Joins Using Dplyr in R

How to perform multiple left joins using dplyr in R

You can use a nested left_join

 library(dplyr)
left_join(x, y, by='Flag') %>%
left_join(., z, by='Flag')

Or another option would be to place all the datasets in a list and use merge from base R with Reduce

Reduce(function(...) merge(..., by='Flag', all.x=TRUE), list(x,y,z))

Or we have join_all from plyr. Here we place the dataframes in a list and use the argument type='left' for a left join.

library(plyr)
join_all(list(x,y,z), by='Flag', type='left')

As @JBGruber mentioned in the comments, it can also be done via purrr

library(purrr)
library(dplyr)
purrr::reduce(list(x,y,z), dplyr::left_join, by = 'Flag')

How to join multiple data frames using dplyr?

It's been too late i know....today I got introduced to the unanswered questions section. Sorry to bother.

Using left_join()

dfs <- list(
df1 = data.frame(b = c("a", "b", "c"), a = 1:3),
df2 = data.frame(d = c("a", "c", "d"), c = 4:6),
df3 = data.frame(b = c("b", "c", "e"), d = 7:9)
)

func <- function(...){
df1 = list(...)[[1]]
df2 = list(...)[[2]]
col1 = colnames(df1)[1]
col2 = colnames(df2)[1]
xxx = left_join(..., by = setNames(col2,col1))
return(xxx)
}
Reduce( func, dfs)
# b a c d
#1 a 1 4 NA
#2 b 2 NA 7
#3 c 3 5 8

Using merge() :

func <- function(...){
df1 = list(...)[[1]]
df2 = list(...)[[2]]
col1 = colnames(df1)[1]
col2 = colnames(df2)[1]
xxx=merge(..., by.x = col1, by.y = col2, , all.x = T)
return(xxx)
}

Reduce( func, dfs)
# b a c d
#1 a 1 4 NA
#2 b 2 NA 7
#3 c 3 5 8

Left join with multiple conditions in R

You can join on more than one variable. The example df you give would actually make a suitable lookup table for this:

value_lookup <- data.frame(
type = c('q1', 'q1', 'q2', 'q2', 'q3', 'q3'),
id = c(1, 2, 1, 3, 1, 2),
value = c('yes', 'no', 'one hour', 'more than two hours', 'blue', 'yellow')
)

Then you just merge on both type and id:

df <- left_join(df, value_lookup, by = c('type', 'id'))  

Usually when I need a lookup table like that I store it in a CSV rather than write it all out in the code, but do whatever suits you.

How to use left_join on several data frames?

Using purrr::reduce to merge multiple dataframes, then use dplyr::coalesce to get first non-na value:

library(dplyr)
library(purrr)

list(table_base, table_a, table_b) %>%
reduce(left_join, by = "cat") %>%
mutate(value = coalesce(!!!select(., starts_with("value")))) %>%
select(cat, value)

# cat value
# 1 a 1
# 2 b 2
# 3 c 9
# 4 d 10

Is there a clean dplyr-way of doing multiple left-(self)joins?

Following the suggestion by @zx8754 one option to achieve your desired result would be to do the left_joins via a recursive function which stops when there are no more matches:

library(dplyr)
library(tidyr)

test_hierarchie <- tribble(
~child, ~parent,
"A", "B",
"B", "C",
"D", "E"
)

left_join_recursive <- function(x, by) {
x <- left_join(x, test_hierarchie, by = setNames("child", by), suffix = c("", "_grant"))
byby <- paste0(by, "_grant")
if (!all(is.na(x[[byby]]))) {
left_join_recursive(x, byby)
} else {
x
}
}

test_hierarchie_transformed <- left_join_recursive(test_hierarchie, "parent") %>%
pivot_longer(names_to = "relation", cols = contains("parent"), values_to = "parent") %>%
filter(!is.na(parent))

test_hierarchie_transformed
#> # A tibble: 4 × 3
#> child relation parent
#> <chr> <chr> <chr>
#> 1 A parent B
#> 2 A parent_grant C
#> 3 B parent C
#> 4 D parent E

To check wether the approach works in a more general case I added another row to your example data:

test_hierarchie <- add_row(test_hierarchie, child = "C", parent = "D")

test_hierarchie_transformed <- left_join_recursive(test_hierarchie, "parent") %>%
pivot_longer(names_to = "relation", cols = contains("parent"), values_to = "parent") %>%
filter(!is.na(parent))

test_hierarchie_transformed
#> # A tibble: 10 × 3
#> child relation parent
#> <chr> <chr> <chr>
#> 1 A parent B
#> 2 A parent_grant C
#> 3 A parent_grant_grant D
#> 4 A parent_grant_grant_grant E
#> 5 B parent C
#> 6 B parent_grant D
#> 7 B parent_grant_grant E
#> 8 D parent E
#> 9 C parent D
#> 10 C parent_grant E

Merging two dataframe with dplyr left join?

The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could

  1. unite Parent.MeSH.ID and Child.MeSH.ID into a new column MeSH_ID
  2. split the united columns in separate IDs using e.g. tidyr::separate_rows. Doing so makes it possible to join the df's by ID.
  3. Use an semi_join to filter out rows in df1 with matches in the newly created df3, finally do a left_join to add the columns from df3. Or if doesn't matter to keep both HUGO_symbol and Gene.Name you could achieve both steps with an inner_join.
df1 <- data.frame(
stringsAsFactors = FALSE,
HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
MeSH_ID = c("D000310", "D0002277", "D000230")
)

df2 <- data.frame(
stringsAsFactors = FALSE,
Gene.Name = c("P53", "HGA2", "ZZZ3"),
Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)

library(dplyr)
library(tidyr)

df3 <- df2 %>%
unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>%
separate_rows(MeSH_ID, sep = ", ")

semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>%
left_join(df3)
#> Joining, by = "MeSH_ID"
#> HUGO_symbol MeSH_ID Gene.Name Parent.MeSH.ID Child.MeSH.ID
#> 1 P53 D000310 P53 D000310 D015675, D006676
#> 2 ZZZ3 D000230 ZZZ3 D001163, D000230 D003451

How to perform left join using multiple columns where one data frame has missingness in one column?

An option using data.table:

library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on=.(id1, id2), mult="first", val2 := val2]
df1[is.na(val2), val2 :=
df2[.SD, on=.(id1), mult="first", val2]]

I have taken the liberty of using the first value if there are multiple joins (i.e. the combination of id1 and id2 in df2 are not unique).



Related Topics



Leave a reply



Submit