Can I replace NAs when joining two data frames with dplyr?
coalesce
might be something you need. It fills the NA from the first vector with values from the second vector at corresponding positions:
library(dplyr)
df1 %>%
left_join(df2, by = "fruit") %>%
mutate(var2 = coalesce(var2.x, var2.y)) %>%
select(-var2.x, -var2.y)
# fruit var1 var3 var2
# 1 apples 1 NA 3
# 2 oranges 2 7 5
# 3 bananas 3 NA 6
# 4 grapes 4 8 6
Or use data.table
, which does in-place replacing:
library(data.table)
setDT(df1)[setDT(df2), on = "fruit", `:=` (var2 = i.var2, var3 = i.var3)]
df1
# fruit var1 var2 var3
# 1: apples 1 3 NA
# 2: oranges 2 5 7
# 3: bananas 3 6 NA
# 4: grapes 4 6 8
Replace NAs in dataframe with values from second dataframe based on multiple criteria
You can create a unique key to update df2
.
unique_key1 <- paste(df1$A, df1$B)
unique_key2 <- paste(df2$A, df2$B)
inds <- is.na(df2$C)
df2$C[inds] <- df1$C[match(unique_key2[inds], unique_key1)]
df2
# A B C E
#1 20210901 15:00 74 A 74
#2 20210903 17:00 27 C 27
#3 20210904 18:00 60 D 60
#4 20210906 20:00 7 F 7
#5 20210907 21:00 96 G 96
#6 20210908 22:00 98 H 98
#7 20210909 23:00 38 I 38
#8 20210910 00:00 89 J 89
#9 20210912 02:00 69 L 69
#10 20210913 03:00 72 M 72
#11 20210914 04:00 76 N 76
#12 20210915 05:00 63 O 63
#13 20210916 06:00 13 P 13
#14 20210918 08:00 25 R 25
#15 20210919 09:00 92 S 92
#16 20210920 10:00 21 T 21
#17 20210921 11:00 79 U 79
#18 20210922 12:00 41 V 41
#19 20210924 14:00 97 X 97
#20 20210925 15:00 16 Y 16
data
cbind
creates a matrix, use data.frame
to create dataframes.
df1 <- data.frame(A, B, C, D)
df2 <- data.frame(A, B, C, E)
Merging two dataframes with left_join produces NAs in 'right' columns
You need to trim the County
variable in the households
df - there are extra spaces so it is matching incorrectly with the crops
df. E.g.:
"Kenya "
"Mombasa "
Adding this extra line before the left_join
fixes it:
households$County <- stringr::str_trim(households$County)
df <- left_join(households, crops)
How to replace NAs in multiple columns with dplyr
You were very close with across()
. The approach you want is:
df %>%
mutate(across(starts_with("v"), coalesce, x))
Notice that the coalesce
goes inside the across()
, and that x
(the second argument to coalesce()
can be provided as a third argument. Result:
v1 v2 v3 x
1 7 3 5 7
2 1 8 6 8
3 2 4 9 9
If you prefer something closer to your approach with coalesce(., x)
, you can also pass that as an anonymous function with a ~
:
df %>%
mutate(across(starts_with("v"), ~ coalesce(., x)))
In other situations, this can be more flexible (for instance, if .
is not the first argument to the function).
After full_join() how to replace NAs in one source with data from other source
you could also do it in three lines with dplyr
and the zoo
package.
library(dplyr)
library(zoo)
df3 <- dplyr::full_join(df1, df2)
df3 %>%
arrange(id) %>%
do(na.locf(.))
Replace NAs in one column with the values of another in dplyr
Tidyverse solution:
library(tidyverse)
dat %>%
transmute(id = coalesce(id_1, id_2))
Base R solution:
dat <- within(dat, {id <- ifelse(is.na(id_1), id_2, id_1); rm(id_1); rm(id_2)})
how to replace only NAs by matching ids from another dataframe in r
You can use match
:
inds <- is.na(df1$End_record_date)
df1$End_record_date[inds] <- df2$Record_date[match(df1$PID[inds], df2$PID)]
df1
# PID End_record_date
#1 123 13-10-2018
#2 123 15-08-2020
#3 234 14-07-2019
#4 234 19-07-2020
#5 345 20-08-2020
Other option is to join the two dataframes and select the first non-NA value from the two.
This can be done in base R as :
transform(merge(df1, df2, by = 'PID'), End_record_date =
ifelse(is.na(End_record_date), Record_date, End_record_date))
Or in dplyr
:
library(dplyr)
inner_join(df1, df2, by = 'PID') %>%
mutate(End_record_date = coalesce(End_record_date,Record_date)) %>%
select(PID, End_record_date)
Merging two dataframe with dplyr left join?
The issue is that a left_join looks for exact matches and there is nothing like "match this or that". Hence, to achieve your desired result you could
- unite
Parent.MeSH.ID
andChild.MeSH.ID
into a new columnMeSH_ID
- split the united columns in separate IDs using e.g.
tidyr::separate_rows
. Doing so makes it possible to join the df's by ID. - Use an
semi_join
to filter out rows in df1 with matches in the newly created df3, finally do aleft_join
to add the columns from df3. Or if doesn't matter to keep bothHUGO_symbol
andGene.Name
you could achieve both steps with aninner_join
.
df1 <- data.frame(
stringsAsFactors = FALSE,
HUGO_symbol = c("P53", "A1BG", "ZZZ3"),
MeSH_ID = c("D000310", "D0002277", "D000230")
)
df2 <- data.frame(
stringsAsFactors = FALSE,
Gene.Name = c("P53", "HGA2", "ZZZ3"),
Parent.MeSH.ID = c("D000310", "D031031", "D001163, D000230"),
Child.MeSH.ID = c("D015675, D006676", "D002277", "D003451")
)
library(dplyr)
library(tidyr)
df3 <- df2 %>%
unite("MeSH_ID", Parent.MeSH.ID, Child.MeSH.ID, sep = ", ", remove = FALSE) %>%
separate_rows(MeSH_ID, sep = ", ")
semi_join(df1, df3, by = c("HUGO_symbol" = "Gene.Name", "MeSH_ID")) %>%
left_join(df3)
#> Joining, by = "MeSH_ID"
#> HUGO_symbol MeSH_ID Gene.Name Parent.MeSH.ID Child.MeSH.ID
#> 1 P53 D000310 P53 D000310 D015675, D006676
#> 2 ZZZ3 D000230 ZZZ3 D001163, D000230 D003451
Related Topics
How to Put a Geom_Sf Produced Map on Top of a Ggmap Produced Raster
Create a Data Frame of Unequal Lengths
Plotting Multiple Time-Series in Ggplot
Is There a R Function That Applies a Function to Each Pair of Columns
Scraping a Dynamic Ecommerce Page with Infinite Scroll
Remove Rows in R Matrix Where All Data Is Na
Using Gsub to Extract Character String Before White Space in R
Detecting Operating System in R (E.G. for Adaptive .Rprofile Files)
Efficiently Sum Across Multiple Columns in R
Tidyverse Pivot_Longer Several Sets of Columns, But Avoid Intermediate Mutate_Wider Steps
Dplyr If_Else() VS Base R Ifelse()
What Is the Algorithm Behind R Core's 'Split' Function
Missing Legend with Ggplot2 and Geom_Line
Ggplot2: Change Order of Display of a Factor Variable on an Axis