Merging Two Dataframes in R

How to join (merge) data frames (inner, outer, left, right)

By using the merge function and its optional parameters:

Inner join: merge(df1, df2) will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId") to make sure that you were matching on only the fields you desired. You can also use the by.x and by.y parameters if the matching variables have different names in the different data frames.

Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)

Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)

Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)

Cross join: merge(x = df1, y = df2, by = NULL)

~~Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable.~~ I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.

You can merge on multiple columns by giving by a vector, e.g., by = c("CustomerId", "OrderId").

If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2" where CustomerId_in_df1 is the name of the column in the first data frame and CustomerId_in_df2 is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)

Merging two dataframes by keeping certain column values in r

We may use rows_update

library(dplyr)
rows_update(df2, df1, by = c("id", "item", "score"))

-output

  id item score cat.a cat.b
1  1   11     1     A     a
2  2   22     0     B     a
3  3   33     1     C     b
4  4   44     1     D     b
5  5   55     1     E     c
6  6   66     0     F     f
7  7   77     1  <NA>  <NA>
8  8   88     1  <NA>  <NA>

R merge two dataframes with same columns without replacing values

Solution:

Thanks to @GregorThomas for providing the answer.

This problem was solved with the following command:

merge(data1, data2, all = TRUE)

Use dplyr package to merge two dataframes into one but add values to specific columns of the merged dataset

Is this the result that you are looking for?

You can use dplyr::bind_rows in combination with tidyr::fill to get it. The column names need some cleaning up though.

library(dplyr)

# Clean up column names to add second dataset to the first using rename() to remove the numbers
original2 %>% 
  rename(id = id1,
         type = type1,
         city = city1,
         state = state1, 
         zip = zip1) %>% 
  # Add dataset 1 and cleaned up dataset 2 together
  bind_rows(original_1, .) %>% 
  # Fill NAs with data from dataset 2 using tidyr::fill()
  tidyr::fill(data_1, .direction = "up") %>% 
  tidyr::fill(data_2, .direction = "up") %>% 
  # Remove "type" column
  select(-type) %>% 
  # Artificially replace values "data1" and "data2" in "data" to row 9 and 11 respectively
  mutate(data = case_when(row_number() == 9 ~ "data1",
                          row_number() == 11 ~ "data2",
                          TRUE ~ NA_character_)) %>% 
  # Remove rows that do not contain a value for "id"
  filter(! is.na(id))

# A tibble: 20 x 7
      id city   state   zip   data  data_1            data_2           
   <dbl> <chr>  <chr>   <chr> <chr> <chr>             <chr>            
 1     1 city1  state1  zip1  NA    Non_changing_data Non_changing_data
 2     2 city2  state2  zip2  NA    Non_changing_data Non_changing_data
 3     3 city3  state3  zip3  NA    Non_changing_data Non_changing_data
 4     4 city4  state4  zip4  NA    Non_changing_data Non_changing_data
 5     5 city5  state5  zip5  NA    Non_changing_data Non_changing_data
 6     6 city6  state6  zip6  NA    Non_changing_data Non_changing_data
 7     7 city7  state7  zip7  NA    Non_changing_data Non_changing_data
 8     8 city8  state8  zip8  NA    Non_changing_data Non_changing_data
 9     9 city9  state9  zip9  data1 Non_changing_data Non_changing_data
10    10 city10 state10 zip10 NA    Non_changing_data Non_changing_data
11    11 city11 state11 zip11 data2 Non_changing_data Non_changing_data
12    12 city12 state12 zip12 NA    Non_changing_data Non_changing_data
13    13 city13 state13 zip13 NA    Non_changing_data Non_changing_data
14    14 city14 state14 zip14 NA    Non_changing_data Non_changing_data
15    15 city15 state15 zip15 NA    Non_changing_data Non_changing_data
16    16 city16 state16 zip16 NA    Non_changing_data Non_changing_data
17    17 city17 state17 zip17 NA    Non_changing_data Non_changing_data
18    18 city18 state18 zip18 NA    Non_changing_data Non_changing_data
19    19 city19 state19 zip19 NA    Non_changing_data Non_changing_data
20    20 city20 state20 zip20 NA    Non_changing_data Non_changing_data

Merging two DataFrames matching rows/columns

You can subset y with dimensions of x and assign -

y[1:nrow(x), 1:ncol(x)] <- x
y

Merging two dataframes by multiple columns without losing data

merge has an argument all that specifies if you want to keep all rows from left and right side (i.e. all rows from x and all rows from y)

 total <- merge(df1,df2,by=c("id","year"), all=TRUE)

How to merge two dataframes with two matching columns in R

We could use left_join

library(dplyr)
df1 %>% 
  left_join(df2, by = c("year","companyID"))

Output:

   year companyID salary Turnover
  <dbl>     <dbl>  <dbl>    <dbl>
1  2009         1   1000    10000
2  2009         2   2000    20000
3  2010         1   1200    12000
4  2010         2   2200    22000
5  2011         3   1500    15000
6  2012         4   1100       NA

How can I merge two dataframes together with some conditional requirements?

Does this work for you?

library(dplyr)
library(data.table)
merge(x = df1, 
      y = df2) %>% 
  filter(TestDate %between% list(Date1, Date2))

How to merge two dataframes specifying specific columns? (R)

We can use dplyr::left_join to merge df1 with a version of df2 that contains only "ID" and "var3". Then mutate the "var" columns to replace NA (missing) values with 0.

df3 <- df1 %>% 
  left_join(select(df2, ID, var3), by = 'ID') %>% 
  mutate(across(-ID, ~replace_na(., 0)))   

     ID  var1  var2  var3
  <dbl> <dbl> <dbl> <dbl>
1  1001     1     0     1
2  1002     0     1     1
3  1003     1     1     0
4  1004     0     0     0

There are several valid ways to select the "var" columns within across. Here I've used -ID. One could also use starts_with('var') or even everything(), though the latter assumes no NA values in "ID".