How to Join (Merge) Data Frames (Inner, Outer, Left, Right)

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

join / merge two data frames with flexible condition

left_join(x, y[-2], by = "id_1") %>% 
left_join(y[-1], by = "id_2") %>%
mutate(region = coalesce(region.x, region.y)) %>%
select(-c(region.x, region.y))

company id_1 id_2 region
1 Apple 789879978 32132131 USA
2 Amazon <NA> 987978987 USA
3 BMW 12312312 <NA> EU
4 Audi <NA> <NA> <NA>

R - Merge data from two data frames with different nrow if text condition is met

A solution with dplyr:

library(dplyr)

id1=c('1text','2text','3text')
df1=data.frame(id1)

id2=c('2text','3text','1text')
area2=c(11,22,33)
df2=data.frame(id1=id2,area2)

inner_join(df1,df2,by="id1")

How to join (merge) data frames by different variables for each observation

You can do this in two steps. First, left join df2 on df1 using X and Y. Second, replace any missing value in Z with the corresponding value in df1 based on X only. But this will only work if there are no duplicates in X.

df4 <- merge(df2, df1, all.x=TRUE); df4
# X Y W Z
#1 1 <NA> -1 NA # <-- this guy could not match, since Y was NA
#2 2 b -2 20
#3 2 c -3 30
#4 4 d -4 NA

ind <- df4$X[is.na(df4$Z)] # returns the indices 1 4
df4$Z[ind] <- df1$Z[ind]
df4
# X Y W Z
#1 1 <NA> -1 10
#2 2 b -2 20
#3 2 c -3 30
#4 4 d -4 NA

If df2$Y contains NA for any X that is duplicated, then the solution is indeterminate.

How to merge two dataframes with two matching columns in R

We could use left_join

library(dplyr)
df1 %>%
left_join(df2, by = c("year","companyID"))

Output:

   year companyID salary Turnover
<dbl> <dbl> <dbl> <dbl>
1 2009 1 1000 10000
2 2009 2 2000 20000
3 2010 1 1200 12000
4 2010 2 2200 22000
5 2011 3 1500 15000
6 2012 4 1100 NA

How to merge two data frames with missing values?

You may try to impute missing values in df1 with adjacent non-missings of df2. Then just merge, where "main", "main_cost", and "rating" columns will automatically be selected. Just "main" would be insufficient, because there are ties.

df1[3:4] <- lapply(names(df2)[3:4], \(z) 
mapply(\(x, y) el(na.omit(c(x, y))), df1[[z]], df2[[z]]))

(res <- merge(df1, df2))
# main main_cost rating combo have_it distance_mi
# 1 burger 7 fine burger_fries FALSE 56
# 2 burger 8 great burger_coke TRUE 20
# 3 pizza 11 great pizza_veg FALSE 40
# 4 pizza 13 bad pizza_bagels TRUE 14
# 5 pizza 3 fine pizza_rolls FALSE 12
# 6 salad 10 decent salad_dressing TRUE 78
# 7 salad 5 great salad_fruit FALSE 66
# 8 steak 4 okay steak_cheese TRUE 30
# 9 steak 7 awesome steak_mash FALSE 19

Note, that this probably only works if the data frames are of same size and row order, and values are successfully imputed so that the merging columns become identical. If NA's are left, say in the "rating" column, try to explicitly specify the merging columns using e.g. by=c("main", "main_cost") where you will end up with "rating.x" and "rating.y", though.


Data:

df1 <- structure(list(combo = c("burger_coke", "burger_fries", "steak_cheese", 
"steak_mash", "salad_dressing", "salad_fruit", "pizza_rolls",
"pizza_bagels", "pizza_veg"), main = c("burger", "burger", "steak",
"steak", "salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, NA, NA, NA, 5L, 3L, 13L, NA), rating = c("great", "fine",
"okay", "awesome", NA, "great", "fine", NA, "great")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))

df2 <- structure(list(have_it = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE), main = c("burger", "burger", "steak", "steak",
"salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, 4L, 7L, 10L, 5L, 3L, 13L, 11L), rating = c("great", "fine",
"okay", "awesome", "decent", "great", "fine", "bad", "great"),
distance_mi = c(20L, 56L, 30L, 19L, 78L, 66L, 12L, 14L, 40L
)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9"))

Merging two dataframe with dplyr left join?

The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could

  1. unite Parent.MeSH.ID and Child.MeSH.ID into a new column MeSH_ID
  2. split the united columns in separate IDs using e.g. tidyr::separate_rows. Doing so makes it possible to join the df's by ID.
  3. Use an semi_join to filter out rows in df1 with matches in the newly created df3, finally do a left_join to add the columns from df3. Or if doesn't matter to keep both HUGO_symbol and Gene.Name you could achieve both steps with an inner_join.
df1 <- data.frame(
stringsAsFactors = FALSE,
HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
MeSH_ID = c("D000310", "D0002277", "D000230")
)

df2 <- data.frame(
stringsAsFactors = FALSE,
Gene.Name = c("P53", "HGA2", "ZZZ3"),
Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)

library(dplyr)
library(tidyr)

df3 <- df2 %>%
unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>%
separate_rows(MeSH_ID, sep = ", ")

semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>%
left_join(df3)
#> Joining, by = "MeSH_ID"
#> HUGO_symbol MeSH_ID Gene.Name Parent.MeSH.ID Child.MeSH.ID
#> 1 P53 D000310 P53 D000310 D015675, D006676
#> 2 ZZZ3 D000230 ZZZ3 D001163, D000230 D003451


Related Topics



Leave a reply



Submit