How to Join (Merge) Data Frames (Inner, Outer, Left, Right)

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

~~Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.~~ I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

join / merge two data frames with flexible condition

left_join(x, y[-2], by = "id_1") %>% 
  left_join(y[-1], by = "id_2") %>% 
  mutate(region = coalesce(region.x, region.y)) %>% 
  select(-c(region.x, region.y))

  company      id_1      id_2 region
1   Apple 789879978  32132131    USA
2  Amazon      <NA> 987978987    USA
3     BMW  12312312      <NA>     EU
4    Audi      <NA>      <NA>   <NA>

R - Merge data from two data frames with different nrow if text condition is met

A solution with dplyr:

library(dplyr)

id1=c('1text','2text','3text')
df1=data.frame(id1)

id2=c('2text','3text','1text')
area2=c(11,22,33)
df2=data.frame(id1=id2,area2)

inner_join(df1,df2,by="id1")

How to join (merge) data frames by different variables for each observation

You can do this in two steps. First, left join df2 on df1 using X and Y. Second, replace any missing value in Z with the corresponding value in df1 based on X only. But this will only work if there are no duplicates in X.

df4 <- merge(df2, df1, all.x=TRUE); df4
#  X    Y  W  Z
#1 1 <NA> -1 NA # <-- this guy could not match, since Y was NA
#2 2    b -2 20
#3 2    c -3 30
#4 4    d -4 NA

ind <- df4$X[is.na(df4$Z)] # returns the indices 1 4
df4$Z[ind] <- df1$Z[ind]
df4
#  X    Y  W  Z
#1 1 <NA> -1 10
#2 2    b -2 20
#3 2    c -3 30
#4 4    d -4 NA

If df2$Y contains NA for any X that is duplicated, then the solution is indeterminate.

How to merge two dataframes with two matching columns in R

We could use left_join

library(dplyr)
df1 %>% 
  left_join(df2, by = c("year","companyID"))

Output:

   year companyID salary Turnover
  <dbl>     <dbl>  <dbl>    <dbl>
1  2009         1   1000    10000
2  2009         2   2000    20000
3  2010         1   1200    12000
4  2010         2   2200    22000
5  2011         3   1500    15000
6  2012         4   1100       NA

How to merge two data frames with missing values?

You may try to impute missing values in df1 with adjacent non-missings of df2. Then just merge, where "main", "main_cost", and "rating" columns will automatically be selected. Just "main" would be insufficient, because there are ties.

df1[3:4] <- lapply(names(df2)[3:4], \(z) 
                   mapply(\(x, y) el(na.omit(c(x, y))), df1[[z]], df2[[z]]))

(res <- merge(df1, df2))
#     main main_cost  rating          combo have_it distance_mi
# 1 burger         7    fine   burger_fries   FALSE          56
# 2 burger         8   great    burger_coke    TRUE          20
# 3  pizza        11   great      pizza_veg   FALSE          40
# 4  pizza        13     bad   pizza_bagels    TRUE          14
# 5  pizza         3    fine    pizza_rolls   FALSE          12
# 6  salad        10  decent salad_dressing    TRUE          78
# 7  salad         5   great    salad_fruit   FALSE          66
# 8  steak         4    okay   steak_cheese    TRUE          30
# 9  steak         7 awesome     steak_mash   FALSE          19

Note, that this probably only works if the data frames are of same size and row order, and values are successfully imputed so that the merging columns become identical. If NA's are left, say in the "rating" column, try to explicitly specify the merging columns using e.g. by=c("main", "main_cost") where you will end up with "rating.x" and "rating.y", though.

Data:

df1 <- structure(list(combo = c("burger_coke", "burger_fries", "steak_cheese", 
"steak_mash", "salad_dressing", "salad_fruit", "pizza_rolls", 
"pizza_bagels", "pizza_veg"), main = c("burger", "burger", "steak", 
"steak", "salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L, 
7L, NA, NA, NA, 5L, 3L, 13L, NA), rating = c("great", "fine", 
"okay", "awesome", NA, "great", "fine", NA, "great")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9"))

df2 <- structure(list(have_it = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, 
FALSE, TRUE, FALSE), main = c("burger", "burger", "steak", "steak", 
"salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L, 
7L, 4L, 7L, 10L, 5L, 3L, 13L, 11L), rating = c("great", "fine", 
"okay", "awesome", "decent", "great", "fine", "bad", "great"), 
    distance_mi = c(20L, 56L, 30L, 19L, 78L, 66L, 12L, 14L, 40L
    )), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9"))

Merging two dataframe with dplyr left join?

The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could

unite Parent.MeSH.ID and Child.MeSH.ID into a new column MeSH_ID
split the united columns in separate IDs using e.g. tidyr::separate_rows. Doing so makes it possible to join the df's by ID.
Use an semi_join to filter out rows in df1 with matches in the newly created df3, finally do a left_join to add the columns from df3. Or if doesn't matter to keep both HUGO_symbol and Gene.Name you could achieve both steps with an inner_join.

df1 <- data.frame(
  stringsAsFactors = FALSE,
       HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
           MeSH_ID = c("D000310", "D0002277", "D000230")
)

df2 <- data.frame(
  stringsAsFactors = FALSE,
         Gene.Name = c("P53", "HGA2", "ZZZ3"),
    Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
     Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)

library(dplyr)
library(tidyr)

df3 <- df2 %>%
  unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>% 
  separate_rows(MeSH_ID, sep = ", ")

semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>% 
  left_join(df3)
#> Joining, by = "MeSH_ID"
#>   HUGO_symbol MeSH_ID Gene.Name   Parent.MeSH.ID    Child.MeSH.ID
#> 1         P53 D000310       P53          D000310 D015675, D006676
#> 2        ZZZ3 D000230      ZZZ3 D001163, D000230          D003451

How to Join (Merge) Data Frames (Inner, Outer, Left, Right)