Fuzzy Matching in R

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

Very Fast string fuzzy matching in R

You could try stringsdist-package.

It's written in C, uses parallel processing and offers various distance metrics, including levenshtein-distance.

library(stringdist)

a<-as.character(c("hello","allo","hola"))
b<-as.character(c("hello","allo","hola"))

start_time <- Sys.time()
res <- stringdistmatrix(a,b, method = "lv")
end_time <- Sys.time()

> end_time - start_time
Time difference of 0.006981134 secs
> res
[,1] [,2] [,3]
[1,] 0 2 3
[2,] 2 0 3
[3,] 3 3 0

diag(res) <- NA
apply(res, 1, FUN = min, na.rm = T)
[1] 2 2 3

How to fuzzy match by words (not letters) in R?

This is a tidyverse way to do the join. It basically finds full_name from B that has the highest number of common words with A.
library(tidyverse)

A1 <- tibble(
nombre_completo = c("martin gallardo", "raul gimenez")
) %>%
mutate(
id_A = row_number()
)

B1 <- tibble(
nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
other_data=c("A", "B")
) %>%
mutate(
id_B = row_number()
)

A2 <- A1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words))

B2 <- B1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words)) %>%
select(name_words, id_B )

left_join(A2, B2, by = "name_words") %>%
group_by(nombre_completo, id_A, id_B) %>%
count() %>% ungroup() %>%
group_by(nombre_completo, id_A) %>%
slice_max(order_by = n) %>%
select("nombre_completo_A" = nombre_completo, id_A, id_B) %>%
left_join(B1, by = "id_B")

Matching strings with abbreviations; fuzzy matching

Changing the dissimilarity measure to the Jaro distance or Jaro-Winkler
distance works for the example provided in your question.

library(stringdist)

vec.a <- c("ce", "amer", "principl")
vec.b <- c("ceo", "american", "principal")

amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = 0) # Jaro
#> [1] 1 2 3
amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = .2) # Jaro-Winkler
#> [1] 1 2 3

Fuzzy matching strings within a single column and documenting possible matches

Do you need something like this?

dt$is.match <- sapply(dt$Title,function(x) toString(agrep(x, dt$Title)), USE.NAMES = FALSE)

dt
# A tibble: 16 x 2
# Title is.match
# <chr> <chr>
# 1 Community reinforcement approach in the treatment of opiate addicts 1
# 2 Therapeutic justice: Life inside drug court 2, 3
# 3 Therapeutic justice: Life inside drug court 2, 3
# 4 Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comp… 4
# 5 An ecosystem for improving the quality of personal health records 5
# 6 Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders 6
# 7 A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders 7, 8
# 8 A model for the assessment of static and dynamic factors in sexual offenders 7, 8
# 9 The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse… 9
#10 Co-occurring disorders among mentally ill jail detainees. Implications for public policy 10
#11 Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudina… 11
#12 Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure 12, 13
#13 Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure 12, 13
#14 Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0 14
#15 Diagnosis of active and latent tuberculosis: summary of NICE guidance 15
#16 Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium 16

Fuzzy Join with Partial String Match in R

We can use fuzzyjoin. Do a regex_left_join after getting the substring from the 'course' columns in both dataset (to make it more matchable)

library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>%
mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>%
regex_left_join(df1 %>%
mutate(grp = toupper(str_remove(course,
"\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>%
select(student_id = student_id.x, course, grade)

-output

# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C

OP's expected output is

 df_final
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C

Fuzzy Match Across Columns in R

There is in the package stringdist a function stingsim which gives you a number between 0 and 1 for similarities between strings.

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")
library(stringdist)

df1 <- data.frame(Name.1, Name.2)
df1$similar <- stringsim(Name.1, Name.2)
df1
#> Name.1 Name.2 similar
#> 1 gonzalez gonzalezsoldevilla 0.4444444
#> 2 wassermanschultz schultz 0.4375000
#> 3 athanasopoulos anthanasopoulos 0.9333333
#> 4 armato strain 0.1666667

fuzzy version of stringr::str_detect for filtering dataframe

You can use agrepl for Approximate String Matching (Fuzzy Matching) which is in base.

example_data[agrep(paste(search_terms, collapse = "|"),
example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017

Or using Reduce instead of | in the regex.

example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
ignore.case=TRUE), search_terms, FALSE),]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017

An alternative might be adist, also in base, which calculates a distance matrix - so it might not be recommended for larger vectors, as the matrix can get large. Here I also choose that a mismatch by 2 characters will be OK.

example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
partial=TRUE) < 3) > 0,]
# project disease_phase startdate
#1 A111 Diabetes, Preclinical 01DEC2018
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017

In case only single words are compared it might be more efficient so split the disease_phase into words using strsplit also in base.

. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
# tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
# project disease_phase startdate
#2 A123 Lipid lowering, Perlcinical 17-OKT-2017
#1 A111 Diabetes, Preclinical 01DEC2018

Some simpler examples using agrep:

#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1] TRUE TRUE FALSE

#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE

#Use boundaries to match words
agrepl("\\bpreclinical\\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1] TRUE FALSE

How much difference will be allowed can be set with max.distance:

max.distance: Maximum distance allowed for a match.  Expressed either
as integer, or as a fraction of the _pattern_ length times
the maximal transformation cost (will be replaced by the
smallest integer not less than the corresponding fraction),
or a list with possible components

‘cost’: maximum number/fraction of match cost (generalized
Levenshtein distance)

‘all’: maximal number/fraction of _all_ transformations
(insertions, deletions and substitutions)

‘insertions’: maximum number/fraction of insertions

‘deletions’: maximum number/fraction of deletions

‘substitutions’: maximum number/fraction of substitutions

And also a Benchmark based on @JBGruber:

system.time({  #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
# User System verstrichen
# 1.008 0.040 1.046

set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

bench::mark(check = FALSE,
stringdist_detect = {
example_large %>%
tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>%
filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
},
GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 stringdist_detect 17.42ms 18.65ms 52.8 7.15MB 19.4 19 7
#2 GKi 5.64ms 6.04ms 165. 869.08KB 6.27 79 3

Also much time could be saved when there is only one, right written, variant of the words of interest in search_terms.



Related Topics



Leave a reply



Submit