How to Replace Nas When Joining Two Data Frames with Dplyr

Can I replace NAs when joining two data frames with dplyr?

coalesce might be something you need. It fills the NA from the first vector with values from the second vector at corresponding positions:

library(dplyr)
df1 %>%
left_join(df2, by = "fruit") %>%
mutate(var2 = coalesce(var2.x, var2.y)) %>%
select(-var2.x, -var2.y)

# fruit var1 var3 var2
# 1 apples 1 NA 3
# 2 oranges 2 7 5
# 3 bananas 3 NA 6
# 4 grapes 4 8 6

Or use data.table, which does in-place replacing:

library(data.table)
setDT(df1)[setDT(df2), on = "fruit", `:=` (var2 = i.var2, var3 = i.var3)]
df1
# fruit var1 var2 var3
# 1: apples 1 3 NA
# 2: oranges 2 5 7
# 3: bananas 3 6 NA
# 4: grapes 4 6 8

Replace NAs in dataframe with values from second dataframe based on multiple criteria

You can create a unique key to update df2.

unique_key1 <- paste(df1$A, df1$B)
unique_key2 <- paste(df2$A, df2$B)
inds <- is.na(df2$C)
df2$C[inds] <- df1$C[match(unique_key2[inds], unique_key1)]
df2

# A B C E
#1 20210901 15:00 74 A 74
#2 20210903 17:00 27 C 27
#3 20210904 18:00 60 D 60
#4 20210906 20:00 7 F 7
#5 20210907 21:00 96 G 96
#6 20210908 22:00 98 H 98
#7 20210909 23:00 38 I 38
#8 20210910 00:00 89 J 89
#9 20210912 02:00 69 L 69
#10 20210913 03:00 72 M 72
#11 20210914 04:00 76 N 76
#12 20210915 05:00 63 O 63
#13 20210916 06:00 13 P 13
#14 20210918 08:00 25 R 25
#15 20210919 09:00 92 S 92
#16 20210920 10:00 21 T 21
#17 20210921 11:00 79 U 79
#18 20210922 12:00 41 V 41
#19 20210924 14:00 97 X 97
#20 20210925 15:00 16 Y 16

data

cbind creates a matrix, use data.frame to create dataframes.

df1 <- data.frame(A, B, C, D)
df2 <- data.frame(A, B, C, E)

Merging two dataframes with left_join produces NAs in 'right' columns

You need to trim the County variable in the households df - there are extra spaces so it is matching incorrectly with the crops df. E.g.:

"Kenya   "
"Mombasa "

Adding this extra line before the left_join fixes it:

households$County <- stringr::str_trim(households$County)
df <- left_join(households, crops)

How to replace NAs in multiple columns with dplyr

You were very close with across(). The approach you want is:

df %>%
mutate(across(starts_with("v"), coalesce, x))

Notice that the coalesce goes inside the across(), and that x (the second argument to coalesce() can be provided as a third argument. Result:

  v1 v2 v3 x
1 7 3 5 7
2 1 8 6 8
3 2 4 9 9

If you prefer something closer to your approach with coalesce(., x), you can also pass that as an anonymous function with a ~:

df %>%
mutate(across(starts_with("v"), ~ coalesce(., x)))

In other situations, this can be more flexible (for instance, if . is not the first argument to the function).

After full_join() how to replace NAs in one source with data from other source

you could also do it in three lines with dplyr and the zoo package.

library(dplyr)
library(zoo)
df3 <- dplyr::full_join(df1, df2)
df3 %>%
arrange(id) %>%
do(na.locf(.))

Replace NAs in one column with the values of another in dplyr

Tidyverse solution:

library(tidyverse)
dat %>%
transmute(id = coalesce(id_1, id_2))

Base R solution:

dat <- within(dat, {id <- ifelse(is.na(id_1), id_2, id_1); rm(id_1); rm(id_2)})

how to replace only NAs by matching ids from another dataframe in r

You can use match :

inds <- is.na(df1$End_record_date)
df1$End_record_date[inds] <- df2$Record_date[match(df1$PID[inds], df2$PID)]
df1

# PID End_record_date
#1 123 13-10-2018
#2 123 15-08-2020
#3 234 14-07-2019
#4 234 19-07-2020
#5 345 20-08-2020

Other option is to join the two dataframes and select the first non-NA value from the two.

This can be done in base R as :

transform(merge(df1, df2, by = 'PID'), End_record_date = 
ifelse(is.na(End_record_date), Record_date, End_record_date))

Or in dplyr :

library(dplyr)

inner_join(df1, df2, by = 'PID') %>%
mutate(End_record_date = coalesce(End_record_date,Record_date)) %>%
select(PID, End_record_date)

Merging two dataframe with dplyr left join?

The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could

  1. unite Parent.MeSH.ID and Child.MeSH.ID into a new column MeSH_ID
  2. split the united columns in separate IDs using e.g. tidyr::separate_rows. Doing so makes it possible to join the df's by ID.
  3. Use an semi_join to filter out rows in df1 with matches in the newly created df3, finally do a left_join to add the columns from df3. Or if doesn't matter to keep both HUGO_symbol and Gene.Name you could achieve both steps with an inner_join.
df1 <- data.frame(
stringsAsFactors = FALSE,
HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
MeSH_ID = c("D000310", "D0002277", "D000230")
)

df2 <- data.frame(
stringsAsFactors = FALSE,
Gene.Name = c("P53", "HGA2", "ZZZ3"),
Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)

library(dplyr)
library(tidyr)

df3 <- df2 %>%
unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>%
separate_rows(MeSH_ID, sep = ", ")

semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>%
left_join(df3)
#> Joining, by = "MeSH_ID"
#> HUGO_symbol MeSH_ID Gene.Name Parent.MeSH.ID Child.MeSH.ID
#> 1 P53 D000310 P53 D000310 D015675, D006676
#> 2 ZZZ3 D000230 ZZZ3 D001163, D000230 D003451


Related Topics



Leave a reply



Submit