How to Merge Two Data Frame Based on Partial String Match with R

R: Merging data with partial matches

Get the data in long format using separate_rows splitting on '|' and for each ID1 summarise the values in one concatenated string.

library(dplyr)
library(tidyr)

df1 %>%
separate_rows(ID2, sep = '\\|') %>%
left_join(df2, by = "ID2") %>%
group_by(ID1) %>%
summarise(across(c(ID2, ID3), ~paste0(na.omit(.), collapse = '|')))

# ID1 ID2 ID3
# <chr> <chr> <chr>
#1 A1 B1|B2 C1|C2
#2 A2 B1 C1
#3 A3 B3 C3
#4 A4 B6|B4 C4
#5 A5 B0|B6|B3 C3

If for every ID it is guaranteed that there would be at least 1 match in df2 as in the example you may use inner_join and drop na.omit.

Merging data with partial match

Using dplyr and fuzzyjoin:

library(dplyr)
# library(fuzzyjoin) # regex_left_join
df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
group_by(SampleID) %>%
summarize(Gene = Gene.x[1], Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA

The first step converts df2$Gene[2] from CLDN5;STK10 to \\b(CLDN5|STK10)\\b, a pattern that allows a match on any of its ;-delimited values (inferred from your expected output).


Edit: if you have a lot of other columns, you may be able to add them to the grouping such that you don't need to explicitly summarize them (with [1]). For example, the above might be rewritten as:

df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
rename(Gene = Gene.x) %>%
group_by(across(SampleID:Gene)) %>%
summarize(Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA

(Renaming from Gene.x to Gene is not necessary but looked nice :-)

This method assumes that all columns that you want to keep are either consecutive (allowing for fromcolumn:tocolumn use of :-ranges) or not difficult to add individually.

How to merge two data frames with specific string match in columns in R?

Here's a simple solution:

library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)

data1 %<>% mutate(lname = str_extract(ProfName, "[A-Za-z\\-]+$"))
data2 %<>% mutate(lname = str_extract(ProfName, "^[A-Za-z\\-]+"))

df <- merge(data1, data2, all.y = TRUE, by = "lname")

head(df)

# lname ProfName.x Title Profession # ProfName.y
# 1 Attaway Alan Attaway PhD Professor Attaway, A
# 2 Barr-Pulliam Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam, D
# 3 Blandford <NA> <NA> <NA> Blandford, K
# 4 Blum Lisa M. Blum LLM Instructor Blum, L
# 5 Callahan Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan, C
# 6 Foster Benjamin P. Foster PhD Professor Foster, B

Join dataframes based on partial string-match between columns

Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1:

s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'

res = df2[df2['FILM'].isin(s)]

print(res)

FILM VOTES
4 Max Steel (2016) 560

match two dataframes for partial strings in R

you can do this relatively simple by creating a column of partially matched dataframes and then just unnest that.

library(tibble)
library(dplyr)
library(tidyr)

df1 <- tibble(
Object = c('house','car','window','garden'),
price = c(20,50,40,100),
feature = c('205','710','open','green')
)

df2 <- tibble(
person = c('Johnson','Lewis','Lewis','Hill'),
feature2 = c('1122056','4Dl-open75','ltkgreen','111710D')
)

result <- df1 %>%
mutate(
df2 = lapply(feature, function(x){df2 %>% filter(grepl(x, feature2))})
) %>%
unnest(df2)


Related Topics



Leave a reply



Submit