R: Merging data with partial matches
Get the data in long format using separate_rows
splitting on '|'
and for each ID1
summarise the values in one concatenated string.
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(ID2, sep = '\\|') %>%
left_join(df2, by = "ID2") %>%
group_by(ID1) %>%
summarise(across(c(ID2, ID3), ~paste0(na.omit(.), collapse = '|')))
# ID1 ID2 ID3
# <chr> <chr> <chr>
#1 A1 B1|B2 C1|C2
#2 A2 B1 C1
#3 A3 B3 C3
#4 A4 B6|B4 C4
#5 A5 B0|B6|B3 C3
If for every ID it is guaranteed that there would be at least 1 match in df2
as in the example you may use inner_join
and drop na.omit
.
Merging data with partial match
Using dplyr
and fuzzyjoin
:
library(dplyr)
# library(fuzzyjoin) # regex_left_join
df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
group_by(SampleID) %>%
summarize(Gene = Gene.x[1], Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA
The first step converts df2$Gene[2]
from CLDN5;STK10
to \\b(CLDN5|STK10)\\b
, a pattern that allows a match on any of its ;
-delimited values (inferred from your expected output).
Edit: if you have a lot of other columns, you may be able to add them to the grouping such that you don't need to explicitly summarize them (with [1]
). For example, the above might be rewritten as:
df2 %>%
mutate(Gene = sapply(strsplit(Gene, ";"), function(z) paste0("\\b(", paste(z, collapse = "|"), ")\\b"))) %>%
fuzzyjoin::regex_left_join(df1, ., by = "Gene") %>%
rename(Gene = Gene.x) %>%
group_by(across(SampleID:Gene)) %>%
summarize(Operation = na.omit(Operation)[1], .groups = "drop")
# # A tibble: 6 x 3
# SampleID Gene Operation
# <int> <chr> <chr>
# 1 1 ARF5;ARG1 Y
# 2 2 AP3B1 NA
# 3 3 CLDN5 Y
# 4 4 XPO1;STX7 Y
# 5 5 ABCC4 Y
# 6 6 FLOT1 NA
(Renaming from Gene.x
to Gene
is not necessary but looked nice :-)
This method assumes that all columns that you want to keep are either consecutive (allowing for fromcolumn:tocolumn
use of :
-ranges) or not difficult to add individually.
How to merge two data frames with specific string match in columns in R?
Here's a simple solution:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
data1 %<>% mutate(lname = str_extract(ProfName, "[A-Za-z\\-]+$"))
data2 %<>% mutate(lname = str_extract(ProfName, "^[A-Za-z\\-]+"))
df <- merge(data1, data2, all.y = TRUE, by = "lname")
head(df)
# lname ProfName.x Title Profession # ProfName.y
# 1 Attaway Alan Attaway PhD Professor Attaway, A
# 2 Barr-Pulliam Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam, D
# 3 Blandford <NA> <NA> <NA> Blandford, K
# 4 Blum Lisa M. Blum LLM Instructor Blum, L
# 5 Callahan Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan, C
# 6 Foster Benjamin P. Foster PhD Professor Foster, B
Join dataframes based on partial string-match between columns
Given input dataframes df1
and df2
, you can use Boolean indexing via pd.Series.isin
. To align the format of the movie strings you need to first concatenate movie and year from df1
:
s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'
res = df2[df2['FILM'].isin(s)]
print(res)
FILM VOTES
4 Max Steel (2016) 560
match two dataframes for partial strings in R
you can do this relatively simple by creating a column of partially matched dataframes and then just unnest that.
library(tibble)
library(dplyr)
library(tidyr)
df1 <- tibble(
Object = c('house','car','window','garden'),
price = c(20,50,40,100),
feature = c('205','710','open','green')
)
df2 <- tibble(
person = c('Johnson','Lewis','Lewis','Hill'),
feature2 = c('1122056','4Dl-open75','ltkgreen','111710D')
)
result <- df1 %>%
mutate(
df2 = lapply(feature, function(x){df2 %>% filter(grepl(x, feature2))})
) %>%
unnest(df2)
Related Topics
How to Render Custom Map Tiles Created with Gdal2Tiles in Leaflet for R
R - Pivoting Duplicate Rows into Multiple Column with Unknown Number of Columns
Chi Square Test for Each Row in Data Frame
"Dims [Product Xx] Do Not Match the Length of Object [Xx]" Error in Using R Function 'Outer'
Non-Equi-Joins in R with Data.Table - Backticked Column Name Trouble
Tls V1.1/Tls V1.2 Support in Rcurl
Reshape Data for Values in One Column
Out of Order Text Labels on Stack Bar Plot (Ggplot)
Rvest Not Recognizing CSS Selector
Changing Line Color in Ggplot Based on Slope
Looping Over Combinations of Regression Model Terms
Http Error 400 on Google_Elevation() Call
In R Data.Frame, Promote Rownames to Actual Column
Why Does Nls Function Not Work in Ggplot2