How to join (merge) data frames (inner, outer, left, right)
By using the merge
function and its optional parameters:
Inner join: merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use the by.x
and by.y
parameters if the matching variables have different names in the different data frames.
Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join: merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.
You can merge on multiple columns by giving by
a vector, e.g., by = c("CustomerId", "OrderId")
.
If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
where CustomerId_in_df1
is the name of the column in the first data frame and CustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)
join / merge two data frames with flexible condition
left_join(x, y[-2], by = "id_1") %>%
left_join(y[-1], by = "id_2") %>%
mutate(region = coalesce(region.x, region.y)) %>%
select(-c(region.x, region.y))
company id_1 id_2 region
1 Apple 789879978 32132131 USA
2 Amazon <NA> 987978987 USA
3 BMW 12312312 <NA> EU
4 Audi <NA> <NA> <NA>
R - Merge data from two data frames with different nrow if text condition is met
A solution with dplyr
:
library(dplyr)
id1=c('1text','2text','3text')
df1=data.frame(id1)
id2=c('2text','3text','1text')
area2=c(11,22,33)
df2=data.frame(id1=id2,area2)
inner_join(df1,df2,by="id1")
How to join (merge) data frames by different variables for each observation
You can do this in two steps. First, left join df2 on df1 using X and Y. Second, replace any missing value in Z with the corresponding value in df1 based on X only. But this will only work if there are no duplicates in X.
df4 <- merge(df2, df1, all.x=TRUE); df4
# X Y W Z
#1 1 <NA> -1 NA # <-- this guy could not match, since Y was NA
#2 2 b -2 20
#3 2 c -3 30
#4 4 d -4 NA
ind <- df4$X[is.na(df4$Z)] # returns the indices 1 4
df4$Z[ind] <- df1$Z[ind]
df4
# X Y W Z
#1 1 <NA> -1 10
#2 2 b -2 20
#3 2 c -3 30
#4 4 d -4 NA
If df2$Y contains NA for any X that is duplicated, then the solution is indeterminate.
How to merge two dataframes with two matching columns in R
We could use left_join
library(dplyr)
df1 %>%
left_join(df2, by = c("year","companyID"))
Output:
year companyID salary Turnover
<dbl> <dbl> <dbl> <dbl>
1 2009 1 1000 10000
2 2009 2 2000 20000
3 2010 1 1200 12000
4 2010 2 2200 22000
5 2011 3 1500 15000
6 2012 4 1100 NA
How to merge two data frames with missing values?
You may try to impute missing values in df1
with adjacent non-missings of df2
. Then just merge
, where "main"
, "main_cost"
, and "rating"
columns will automatically be selected. Just "main"
would be insufficient, because there are ties.
df1[3:4] <- lapply(names(df2)[3:4], \(z)
mapply(\(x, y) el(na.omit(c(x, y))), df1[[z]], df2[[z]]))
(res <- merge(df1, df2))
# main main_cost rating combo have_it distance_mi
# 1 burger 7 fine burger_fries FALSE 56
# 2 burger 8 great burger_coke TRUE 20
# 3 pizza 11 great pizza_veg FALSE 40
# 4 pizza 13 bad pizza_bagels TRUE 14
# 5 pizza 3 fine pizza_rolls FALSE 12
# 6 salad 10 decent salad_dressing TRUE 78
# 7 salad 5 great salad_fruit FALSE 66
# 8 steak 4 okay steak_cheese TRUE 30
# 9 steak 7 awesome steak_mash FALSE 19
Note, that this probably only works if the data frames are of same size and row order, and values are successfully imputed so that the merging columns become identical. If NA's are left, say in the "rating"
column, try to explicitly specify the merging columns using e.g. by=c("main", "main_cost")
where you will end up with "rating.x"
and "rating.y"
, though.
Data:
df1 <- structure(list(combo = c("burger_coke", "burger_fries", "steak_cheese",
"steak_mash", "salad_dressing", "salad_fruit", "pizza_rolls",
"pizza_bagels", "pizza_veg"), main = c("burger", "burger", "steak",
"steak", "salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, NA, NA, NA, 5L, 3L, 13L, NA), rating = c("great", "fine",
"okay", "awesome", NA, "great", "fine", NA, "great")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
df2 <- structure(list(have_it = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE), main = c("burger", "burger", "steak", "steak",
"salad", "salad", "pizza", "pizza", "pizza"), main_cost = c(8L,
7L, 4L, 7L, 10L, 5L, 3L, 13L, 11L), rating = c("great", "fine",
"okay", "awesome", "decent", "great", "fine", "bad", "great"),
distance_mi = c(20L, 56L, 30L, 19L, 78L, 66L, 12L, 14L, 40L
)), class = "data.frame", row.names = c("1", "2", "3", "4",
"5", "6", "7", "8", "9"))
Merging two dataframe with dplyr left join?
The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could
- unite
Parent.MeSH.ID
andChild.MeSH.ID
into a new columnMeSH_ID
- split the united columns in separate IDs using e.g.
tidyr::separate_rows
. Doing so makes it possible to join the df's by ID. - Use an
semi_join
to filter out rows in df1 with matches in the newly created df3, finally do aleft_join
to add the columns from df3. Or if doesn't matter to keep bothHUGO_symbol
andGene.Name
you could achieve both steps with aninner_join
.
df1 <- data.frame(
stringsAsFactors = FALSE,
HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
MeSH_ID = c("D000310", "D0002277", "D000230")
)
df2 <- data.frame(
stringsAsFactors = FALSE,
Gene.Name = c("P53", "HGA2", "ZZZ3"),
Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)
library(dplyr)
library(tidyr)
df3 <- df2 %>%
unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>%
separate_rows(MeSH_ID, sep = ", ")
semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>%
left_join(df3)
#> Joining, by = "MeSH_ID"
#> HUGO_symbol MeSH_ID Gene.Name Parent.MeSH.ID Child.MeSH.ID
#> 1 P53 D000310 P53 D000310 D015675, D006676
#> 2 ZZZ3 D000230 ZZZ3 D001163, D000230 D003451
Related Topics
Remove Specific Characters from Column Names in R
How to Loop Through List and Create Separate Dataframes in R
Ggplot2: Setting Geom_Bar Baseline to 1 Instead of Zero
Rstudio Does Not Display Any Output in Console After Entering Code
Pass a Data.Frame Column Name to a Function
Count Number of Rows in a Data Frame in R Based on Group
Increasing (Or Decreasing) the Memory Available to R Processes
Replace Na With Previous or Next Value, by Group, Using Dplyr
Count Number of Occurences For Each Unique Value
Selecting Multiple Odd or Even Columns/Rows for Dataframe
How to Delete Rows Where All the Columns Are Zero
How to Add a Diagonal Line to a Plot
Setting Individual Axis Limits With Facet_Wrap and Scales = "Free" in Ggplot2
Error in If/While (Condition) {: Missing Value Where True/False Needed
Relative Frequencies/Proportions With Dplyr
Why Is It Not Advisable to Use Attach() in R, and What Should I Use Instead