How to Perform Multiple Left Joins Using Dplyr in R

How to perform multiple left joins using dplyr in R

You can use a nested left_join

 library(dplyr)
 left_join(x, y, by='Flag') %>%
                left_join(., z, by='Flag')

Or another option would be to place all the datasets in a list and use merge from base R with Reduce

Reduce(function(...) merge(..., by='Flag', all.x=TRUE), list(x,y,z))

Or we have join_all from plyr. Here we place the dataframes in a list and use the argument type='left' for a left join.

library(plyr)
join_all(list(x,y,z), by='Flag', type='left')

As @JBGruber mentioned in the comments, it can also be done via purrr

library(purrr)
library(dplyr)
purrr::reduce(list(x,y,z), dplyr::left_join, by = 'Flag')

How to join multiple data frames using dplyr?

It's been too late i know....today I got introduced to the unanswered questions section. Sorry to bother.

Using left_join()

dfs <- list(
              df1 = data.frame(b = c("a", "b", "c"), a = 1:3),
              df2 = data.frame(d = c("a", "c", "d"), c = 4:6),
              df3 = data.frame(b = c("b", "c", "e"), d = 7:9)
         )

func <- function(...){
  df1 = list(...)[[1]]
  df2 = list(...)[[2]]
  col1 = colnames(df1)[1]
  col2 = colnames(df2)[1]
  xxx = left_join(..., by = setNames(col2,col1))
  return(xxx)
}
Reduce( func, dfs)
#  b a  c  d
#1 a 1  4 NA
#2 b 2 NA  7
#3 c 3  5  8

Using merge() :

func <- function(...){
  df1 = list(...)[[1]]
  df2 = list(...)[[2]]
  col1 = colnames(df1)[1]
  col2 = colnames(df2)[1]
  xxx=merge(..., by.x = col1, by.y = col2, , all.x = T)
  return(xxx)
}

Reduce( func, dfs)
#  b a  c  d
#1 a 1  4 NA
#2 b 2 NA  7
#3 c 3  5  8

Left join with multiple conditions in R

You can join on more than one variable. The example df you give would actually make a suitable lookup table for this:

value_lookup <- data.frame(
  type = c('q1', 'q1', 'q2', 'q2', 'q3', 'q3'),
  id = c(1, 2, 1, 3, 1, 2),
  value = c('yes', 'no', 'one hour', 'more than two hours', 'blue', 'yellow')
)

Then you just merge on both type and id:

df <- left_join(df, value_lookup, by = c('type', 'id'))

Usually when I need a lookup table like that I store it in a CSV rather than write it all out in the code, but do whatever suits you.

How to use left_join on several data frames?

Using purrr::reduce to merge multiple dataframes, then use dplyr::coalesce to get first non-na value:

library(dplyr)
library(purrr)

list(table_base, table_a, table_b) %>% 
  reduce(left_join, by = "cat") %>% 
  mutate(value = coalesce(!!!select(., starts_with("value")))) %>% 
  select(cat, value)

#   cat value
# 1   a     1
# 2   b     2
# 3   c     9
# 4   d    10

Is there a clean dplyr-way of doing multiple left-(self)joins?

Following the suggestion by @zx8754 one option to achieve your desired result would be to do the left_joins via a recursive function which stops when there are no more matches:

library(dplyr)
library(tidyr)

test_hierarchie <- tribble(
  ~child, ~parent,
  "A", "B",
  "B", "C",
  "D", "E"
)

left_join_recursive <- function(x, by) {
  x <- left_join(x, test_hierarchie, by = setNames("child", by), suffix = c("", "_grant"))
  byby <- paste0(by, "_grant")
  if (!all(is.na(x[[byby]]))) {
    left_join_recursive(x, byby)  
  } else {
    x
  }
}

test_hierarchie_transformed <- left_join_recursive(test_hierarchie, "parent") %>%
  pivot_longer(names_to = "relation", cols = contains("parent"), values_to = "parent") %>%
  filter(!is.na(parent))

test_hierarchie_transformed
#> # A tibble: 4 × 3
#>   child relation     parent
#>   <chr> <chr>        <chr> 
#> 1 A     parent       B     
#> 2 A     parent_grant C     
#> 3 B     parent       C     
#> 4 D     parent       E

To check wether the approach works in a more general case I added another row to your example data:

test_hierarchie <- add_row(test_hierarchie, child = "C", parent = "D")

test_hierarchie_transformed <- left_join_recursive(test_hierarchie, "parent") %>%
  pivot_longer(names_to = "relation", cols = contains("parent"), values_to = "parent") %>%
  filter(!is.na(parent))

test_hierarchie_transformed
#> # A tibble: 10 × 3
#>    child relation                 parent
#>    <chr> <chr>                    <chr> 
#>  1 A     parent                   B     
#>  2 A     parent_grant             C     
#>  3 A     parent_grant_grant       D     
#>  4 A     parent_grant_grant_grant E     
#>  5 B     parent                   C     
#>  6 B     parent_grant             D     
#>  7 B     parent_grant_grant       E     
#>  8 D     parent                   E     
#>  9 C     parent                   D     
#> 10 C     parent_grant             E

Merging two dataframe with dplyr left join?

The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could

unite Parent.MeSH.ID and Child.MeSH.ID into a new column MeSH_ID
split the united columns in separate IDs using e.g. tidyr::separate_rows. Doing so makes it possible to join the df's by ID.
Use an semi_join to filter out rows in df1 with matches in the newly created df3, finally do a left_join to add the columns from df3. Or if doesn't matter to keep both HUGO_symbol and Gene.Name you could achieve both steps with an inner_join.

df1 <- data.frame(
  stringsAsFactors = FALSE,
       HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
           MeSH_ID = c("D000310", "D0002277", "D000230")
)

df2 <- data.frame(
  stringsAsFactors = FALSE,
         Gene.Name = c("P53", "HGA2", "ZZZ3"),
    Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
     Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)

library(dplyr)
library(tidyr)

df3 <- df2 %>%
  unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>% 
  separate_rows(MeSH_ID, sep = ", ")

semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>% 
  left_join(df3)
#> Joining, by = "MeSH_ID"
#>   HUGO_symbol MeSH_ID Gene.Name   Parent.MeSH.ID    Child.MeSH.ID
#> 1         P53 D000310       P53          D000310 D015675, D006676
#> 2        ZZZ3 D000230      ZZZ3 D001163, D000230          D003451

How to perform left join using multiple columns where one data frame has missingness in one column?

An option using data.table:

library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on=.(id1, id2), mult="first", val2 := val2]
df1[is.na(val2), val2 :=
    df2[.SD, on=.(id1), mult="first", val2]]

I have taken the liberty of using the first value if there are multiple joins (i.e. the combination of id1 and id2 in df2 are not unique).

How to Perform Multiple Left Joins Using Dplyr in R