How can I match fuzzy match strings from two datasets?
The solution depends on the desired cardinality of your matching a
to b
. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
One-to-one case (requires assignment algorithm):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim
.
Not familiar with AGREP but here's example using stringdist
for your distance matrix.
library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
Produces the assignment:
a b d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
Many-to-one case (not an assignment problem):
do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
a_name b_name dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333
Edit: use method="jw"
to produce desired results. See help("stringdist-package")
Very Fast string fuzzy matching in R
You could try stringsdist
-package.
It's written in C, uses parallel processing and offers various distance metrics, including levenshtein-distance.
library(stringdist)
a<-as.character(c("hello","allo","hola"))
b<-as.character(c("hello","allo","hola"))
start_time <- Sys.time()
res <- stringdistmatrix(a,b, method = "lv")
end_time <- Sys.time()
> end_time - start_time
Time difference of 0.006981134 secs
> res
[,1] [,2] [,3]
[1,] 0 2 3
[2,] 2 0 3
[3,] 3 3 0
diag(res) <- NA
apply(res, 1, FUN = min, na.rm = T)
[1] 2 2 3
How to fuzzy match by words (not letters) in R?
This is a tidyverse way to do the join. It basically finds full_name from B that has the highest number of common words with A.
library(tidyverse)
A1 <- tibble(
nombre_completo = c("martin gallardo", "raul gimenez")
) %>%
mutate(
id_A = row_number()
)
B1 <- tibble(
nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
other_data=c("A", "B")
) %>%
mutate(
id_B = row_number()
)
A2 <- A1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words))
B2 <- B1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words)) %>%
select(name_words, id_B )
left_join(A2, B2, by = "name_words") %>%
group_by(nombre_completo, id_A, id_B) %>%
count() %>% ungroup() %>%
group_by(nombre_completo, id_A) %>%
slice_max(order_by = n) %>%
select("nombre_completo_A" = nombre_completo, id_A, id_B) %>%
left_join(B1, by = "id_B")
Matching strings with abbreviations; fuzzy matching
Changing the dissimilarity measure to the Jaro distance or Jaro-Winkler
distance works for the example provided in your question.
library(stringdist)
vec.a <- c("ce", "amer", "principl")
vec.b <- c("ceo", "american", "principal")
amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = 0) # Jaro
#> [1] 1 2 3
amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = .2) # Jaro-Winkler
#> [1] 1 2 3
Fuzzy matching strings within a single column and documenting possible matches
Do you need something like this?
dt$is.match <- sapply(dt$Title,function(x) toString(agrep(x, dt$Title)), USE.NAMES = FALSE)
dt
# A tibble: 16 x 2
# Title is.match
# <chr> <chr>
# 1 Community reinforcement approach in the treatment of opiate addicts 1
# 2 Therapeutic justice: Life inside drug court 2, 3
# 3 Therapeutic justice: Life inside drug court 2, 3
# 4 Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comp… 4
# 5 An ecosystem for improving the quality of personal health records 5
# 6 Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders 6
# 7 A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders 7, 8
# 8 A model for the assessment of static and dynamic factors in sexual offenders 7, 8
# 9 The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse… 9
#10 Co-occurring disorders among mentally ill jail detainees. Implications for public policy 10
#11 Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudina… 11
#12 Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure 12, 13
#13 Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure 12, 13
#14 Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0 14
#15 Diagnosis of active and latent tuberculosis: summary of NICE guidance 15
#16 Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium 16
Fuzzy Join with Partial String Match in R
We can use fuzzyjoin
. Do a regex_left_join
after getting the substring from the 'course' columns in both dataset (to make it more matchable)
library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>%
mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>%
regex_left_join(df1 %>%
mutate(grp = toupper(str_remove(course,
"\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>%
select(student_id = student_id.x, course, grade)
-output
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C
OP's expected output is
df_final
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C
Fuzzy Match Across Columns in R
There is in the package stringdist
a function stingsim
which gives you a number between 0 and 1 for similarities between strings.
Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")
library(stringdist)
df1 <- data.frame(Name.1, Name.2)
df1$similar <- stringsim(Name.1, Name.2)
df1
#> Name.1 Name.2 similar
#> 1 gonzalez gonzalezsoldevilla 0.4444444
#> 2 wassermanschultz schultz 0.4375000
#> 3 athanasopoulos anthanasopoulos 0.9333333
#> 4 armato strain 0.1666667
fuzzy version of stringr::str_detect for filtering dataframe
You can use agrepl
for Approximate String Matching (Fuzzy Matching) which is in base.
example_data[agrep(paste(search_terms, collapse = "|"),
example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
Or using Reduce
instead of |
in the regex.
example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
ignore.case=TRUE), search_terms, FALSE),]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
An alternative might be adist
, also in base, which calculates a distance matrix - so it might not be recommended for larger vectors, as the matrix can get large. Here I also choose that a mismatch by 2 characters will be OK.
example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
partial=TRUE) < 3) > 0,]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
In case only single words are compared it might be more efficient so split the disease_phase
into words using strsplit
also in base.
. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
# tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
# project disease_phase startdate
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
#1 A111 Diabetes, Preclinical 01DEC2018
Some simpler examples using agrep
:
#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1] TRUE TRUE FALSE
#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE
#Use boundaries to match words
agrepl("\\bpreclinical\\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1] TRUE FALSE
How much difference will be allowed can be set with max.distance:
max.distance: Maximum distance allowed for a match. Expressed either
as integer, or as a fraction of the _pattern_ length times
the maximal transformation cost (will be replaced by the
smallest integer not less than the corresponding fraction),
or a list with possible components
‘cost’: maximum number/fraction of match cost (generalized
Levenshtein distance)
‘all’: maximal number/fraction of _all_ transformations
(insertions, deletions and substitutions)
‘insertions’: maximum number/fraction of insertions
‘deletions’: maximum number/fraction of deletions
‘substitutions’: maximum number/fraction of substitutions
And also a Benchmark based on @JBGruber:
system.time({ #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
# User System verstrichen
# 1.008 0.040 1.046
set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)
stringdist_detect <- function(a, b, method = "osa", thres = 2) {
Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}
bench::mark(check = FALSE,
stringdist_detect = {
example_large %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
},
GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 stringdist_detect 17.42ms 18.65ms 52.8 7.15MB 19.4 19 7
#2 GKi 5.64ms 6.04ms 165. 869.08KB 6.27 79 3
Also much time could be saved when there is only one, right written, variant of the words of interest in search_terms
.
Related Topics
Insert Missing Time Rows into a Dataframe
How to Change the Default Directory in Rstudio (Or R)
Enclosing Variables Within for Loop
Contrast Between Label and Background: Determine If Color Is Light or Dark
From Long to Wide Data with Multiple Columns
Combine/Merge Columns While Avoiding Na
How to Modify Unexported Object in a Package
Split or Separate Uneven/Unequal Strings with No Delimiter
Adding Multiple Lag Variables Using Dplyr and for Loops
Breaks for Scale_X_Date in Ggplot2 and R
Generally Disable Dimension Dropping for Matrices
Are Data Tables with More Than 2^31 Rows Supported in R with the Data Table Package Yet
How to Figure Third Friday of a Month in R