Conditionally mutate on partial string match across data frames of different lengths R/dplyr/stringr
library(dplyr)
JobDF %>%
mutate(inNameDF = ifelse(str_detect(occupation, paste0(NameDF$names, collapse = "|")),"yes","no"))
The output of paste0
is this:
paste0(NameDF$names, collapse = "|")
[1] "Jane|John"
Which is a regular expression to match "Jane" or "John". If you need the match to be case insensitive then wrap it in stringr::regex
like so: regex(paste0(...), ignore_case = T)
.
If the name was always the first word then you could also do something like:
ifelse(sapply(str_split(JobDF$occupation, " "), `[[`, 1) %in% NameDF$names, "yes", "no")
[1] "no" "yes" "yes" "no" "yes"
Which extracts the first word from occupation
and checks if it is %in%
names
.
Is there any way to do partial String matching in R?
I have used regular expression and merging of the two dataframes as shown below:
library(stringr)
library(dplyr)
df2$ID <- str_trim(str_extract(df2$Text, pattern = "Q\\S*|A\\S*"))
df <- left_join(df1, df2, by = "ID")
Join data frames based fuzzy matching of strings
Perhaps this is what you're looking for?
library(dplyr)
library(fuzzyjoin)
library(stringr)
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = str_detect)
## A tibble: 2 x 4
# col1 col2 col3 col4
# <chr> <int> <chr> <dbl>
#1 Banana Shipping 2 Banana 700
#2 FedEX USA Ground 3 FedEX USA 900
If you wanted to ignore case, you could define your own str_detect
.
my_str_detect <- function(x,y){str_detect(x,regex(y, ignore_case = TRUE))}
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_str_detect)
## A tibble: 3 x 4
# col1 col2 col3 col4
# <chr> <int> <chr> <dbl>
#1 Banana Shipping 2 Banana 700
#2 FedEX USA Ground 3 FedEX USA 900
#3 FedEx USA Commercial 4 FedEX USA 900
For bonus points you can use agrepl
from this question.
You can modify the max.distance =
argument and potentially add cost =
. See help(agrepl)
for more.
my_match_fun <- Vectorize(function(x,y) agrepl(x, y, ignore.case=TRUE, max.distance = 0.7, useBytes = TRUE))
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_match_fun)
## A tibble: 4 x 4
# col1 col2 col3 col4
# <chr> <int> <chr> <dbl>
#1 Banana Shipping 2 Banana 700
#2 FedEX USA Ground 3 FedEX USA 900
#3 FedEx USA Commercial 4 FedEX USA 900
#4 FedEx International 5 FedEX USA 900
fast partial match checking in R (or Python or Julia)
This is an [r] option aimed at reducing the number of times you are calling str_detect()
(i.e., your example is slow because the function is called several thousand times; and for not using fixed()
or fixed = TRUE
as jpiversen already pointed out). Answer explained in comments in the code; I will try to jump on tomorrow to explain a bit more.
This should scale reasonably well and be more memory efficient than the current approach too because reduces the rowwise computations to an absolute minimum.
Benchmarks:
n = 2000
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 6.67s 6.67s 0.150 31.95MB 0.300 1
2 using_fixed() 496.54ms 496.54ms 2.01 61.39MB 4.03 1
3 using_map_fixed() 493.35ms 493.35ms 2.03 60.27MB 6.08 1
4 andrew_fun() 167.78ms 167.78ms 5.96 1.59MB 0 1
n = 4000
Note: I am not sure if you need the answer to scale; but the approach of reducing the memory-intensive part does seem to do just that (although the time difference is negligible for n = 4000 for 1 iteration, IMO).
# A tibble: 4 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 original() 26.63s 26.63s 0.0376 122.33MB 0.150 1
2 using_fixed() 1.91s 1.91s 0.525 243.96MB 3.67 1
3 using_map_fixed() 1.87s 1.87s 0.534 236.62MB 3.20 1
4 andrew_fun() 674.36ms 674.36ms 1.48 7.59MB 0 1
Code w/ comments:
# This is so we do not retain the strings with the max number of
# characters in our pattern because we are checking with %in% already
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
# Creating large patterns (excluding values w/ max number of characters)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
# First checking using %in%
idx_a = data_set_A$name %in% data_set_B$name_2
# Next, IDing when a(string) matches b(pattern)
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
# IDing a(pattern) matches b(string) so we do not run every row of
# a(as a pattern) against all of b
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
# Using unmatched values of a as a pattern for the reduced set for b
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
# A tibble: 237 × 2
name ID_A
<chr> <int>
1 wknrsauuj 2
2 lyw 7
3 igwsvrzpk 16
4 zozxjpu 18
5 cgn 22
6 oqo 45
7 gkritbe 47
8 uuq 92
9 lhwfyksz 94
10 tuw 100
# … with 227 more rows
Reproducible R code for benchmarks
The following code is largely taken from jpiversen who provided a great answer:
library(dplyr)
library(stringr)
n = 2000
set.seed(1)
data_set_A <- tibble(name = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_A = 1:n())
set.seed(2)
data_set_B <- tibble(name_2 = unique(replicate(n, paste(sample(letters, runif(1, 3, 10), replace = T), collapse = "")))) %>%
mutate(ID_B = 1:n())
original <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, data_set_B$name_2)) | any(str_detect(data_set_B$name_2, name))) %>%
ungroup()
}
using_fixed <- function() {
data_set_A %>%
rowwise() %>%
filter(any(str_detect(name, fixed(data_set_B$name_2))) | any(str_detect(data_set_B$name_2, fixed(name)))) %>%
ungroup()
}
using_map_fixed <- function() {
logical_vec <- data_set_A$name %>%
purrr::map_lgl(
~any(stringr::str_detect(.x, fixed(data_set_B$name_2))) ||
any(stringr::str_detect(data_set_B$name_2, fixed(.x)))
)
data_set_A[logical_vec, ]
}
andrew_fun = function() {
nchar_a = nchar(data_set_A$name)
nchar_b = nchar(data_set_B$name_2)
pattern_a = str_c(unique(data_set_A$name[nchar_a != max(nchar_a, na.rm = TRUE)]), collapse = "|")
pattern_b = str_c(unique(data_set_B$name_2[nchar_b != max(nchar_b, na.rm = TRUE)]), collapse = "|")
idx_a = data_set_A$name %in% data_set_B$name_2
idx_a[!idx_a] = str_detect(data_set_A$name[!idx_a], pattern_b)
b_to_check = data_set_B$name_2[str_detect(data_set_B$name_2, pattern_a)]
idx_a[!idx_a] = vapply(data_set_A$name[!idx_a], function(name) {
any(grepl(name, b_to_check, fixed = TRUE))
}, logical(1L), USE.NAMES = FALSE)
data_set_A[idx_a, ]
}
bm = bench::mark(
original(),
using_fixed(),
using_map_fixed(),
andrew_fun(),
iterations = 1
)
How do I match on partial field matches in two R data sets using dplyr
We could paste
the elements of 'other_cd' separated by |
for matching any of the elements
library(dplyr)
library(stringr)
tbl %>%
filter(str_detect(code, str_c(other_cd$other_cd, collapse="|"))) %>%
summarise(count = n_distinct(id))
Update
In the updated post, OP wants to create a new column from the other_cd. In that case, we can use str_extract
tbl %>%
mutate(other_cd = str_extract(code, str_c(other_cd$other_cd, collapse="|")))
# id code other_cd
#1 1 a1231 a123
#2 2 b3211 b321
#3 3 c9871985 c987
Or if the number of rows are the same
tbl %>%
filter(str_detect(code, as.character(other_cd$other_cd)))
Join tables via multiple partial matching
A double-join can work.
Notes: You don't appear to be using segment
, so I'm discarding it here, but this might be adapted if needed. Also, I added stringsAsFactors=FALSE
to your data, since otherwise combining vectors of factor
s can be problematic.)
library(dplyr)
tracksmod <- bind_rows(
select(tracks, trackID, sta=from),
select(tracks, trackID, sta=to)
)
head(tracksmod)
# trackID sta
# 1 A station_11
# 2 A station_12
# 3 A station_13
# 4 A station_14
# 5 B station_15
# 6 B station_16
sightings %>%
left_join(select(tracksmod, trackID, from=sta), by="from") %>%
left_join(select(tracksmod, trackID2=trackID, to=sta), by="to") %>%
mutate(trackID = if_else(trackID == trackID2, trackID, NA_character_)) %>%
select(-trackID2)
# from to trackID
# 1 station_24 station_14 A
# 2 station_28 station_16 B
# 3 station_14 station_25 <NA>
I did not assume that directionality was important. That is, I'm not assuming that a station listed in from
must always be in the from
column. This is why I converted tracks
to tracksmod
, in order to identify a station with an id regardless of direction.
Related Topics
Check If the Number Is Integer
How to Make Consistent-Width Plots in Ggplot (With Legends)
How to Get a Vertical Geom_Vline to an X-Axis of Class Date
Change Bar Plot Colour in Geom_Bar With Ggplot2 in R
How to Subtract/Add Days From/To a Date
Why Do I Get "Warning Longer Object Length Is Not a Multiple of Shorter Object Length"
Format Numbers With Million (M) and Billion (B) Suffixes
Index Values from a Matrix Using Row, Col Indices
Count Nas Per Row in Dataframe
Rcpp Pass by Reference Vs. by Value
As.Date With Dates in Format M/D/Y in R
Get Specific Object from Rdata File
Subsetting R Data Frame Results in Mysterious Na Rows
Putting Mathematical Symbols and Subscripts Mixed With Regular Letters
How to Read a CSV File in R With Different Number of Columns
How Does Predict.Lm() Compute Confidence Interval and Prediction Interval