Fuzzy Matching in R

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
  x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
  # 1 for already assigned, -1 for unassigned and unassignable
  while(any(x==0)){
    min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
    a_sel <- a[d==min_d & x==0][1] 
    b_sel <- b[d==min_d & a == a_sel & x==0][1] 
    x[a==a_sel & b == b_sel] <- 1
    x[x==0 & (a==a_sel|b==b_sel)] <- -1
  }
  cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co    Ace Co. 0.04762
2  Bayes Bayes Inc. 0.16667
3    asd       asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1  Ace Co    Ace Co. 0.04762
11   Baes Bayes Inc. 0.20000
8   Bayes Bayes Inc. 0.16667
12   Bays Bayes Inc. 0.20000
10    Bcy Bayes Inc. 0.37778
15    asd       asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

Very Fast string fuzzy matching in R

You could try stringsdist-package.

It's written in C, uses parallel processing and offers various distance metrics, including levenshtein-distance.

library(stringdist)

a<-as.character(c("hello","allo","hola"))
b<-as.character(c("hello","allo","hola"))

start_time <- Sys.time()
res <- stringdistmatrix(a,b, method = "lv")
end_time <- Sys.time()

> end_time - start_time
Time difference of 0.006981134 secs
> res
     [,1] [,2] [,3]
[1,]    0    2    3
[2,]    2    0    3
[3,]    3    3    0

diag(res) <- NA
apply(res, 1, FUN = min, na.rm = T)
[1] 2 2 3

How to fuzzy match by words (not letters) in R?

This is a tidyverse way to do the join. It basically finds full_name from B that has the highest number of common words with A.
library(tidyverse)

A1 <- tibble(
  nombre_completo = c("martin gallardo", "raul gimenez")
  ) %>%
  mutate(
    id_A = row_number()
  )

B1 <- tibble(
  nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
  other_data=c("A", "B")
  ) %>%
  mutate(
    id_B = row_number()
  )

A2 <- A1 %>%
  mutate(
    name_words = str_split(nombre_completo, pattern = " ")
  ) %>%
  unnest(cols = c(name_words))

B2 <- B1 %>%
  mutate(
    name_words = str_split(nombre_completo, pattern = " ")
  ) %>%
  unnest(cols = c(name_words)) %>%
  select(name_words, id_B )

left_join(A2, B2, by = "name_words") %>%
  group_by(nombre_completo, id_A, id_B) %>%
  count() %>% ungroup() %>%
  group_by(nombre_completo, id_A) %>%
  slice_max(order_by = n) %>%
  select("nombre_completo_A" = nombre_completo, id_A, id_B) %>%
  left_join(B1, by = "id_B")

Matching strings with abbreviations; fuzzy matching

Changing the dissimilarity measure to the Jaro distance or Jaro-Winkler
distance works for the example provided in your question.

library(stringdist)

vec.a <- c("ce", "amer", "principl")
vec.b <- c("ceo", "american", "principal")

amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = 0) # Jaro
#> [1] 1 2 3
amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = .2) # Jaro-Winkler
#> [1] 1 2 3

Fuzzy matching strings within a single column and documenting possible matches

Do you need something like this?

dt$is.match <- sapply(dt$Title,function(x) toString(agrep(x, dt$Title)), USE.NAMES = FALSE)

dt
# A tibble: 16 x 2
#   Title                                                                                                    is.match
#   <chr>                                                                                                    <chr>   
# 1 Community reinforcement approach in the treatment of opiate addicts                                      1       
# 2 Therapeutic justice: Life inside drug court                                                              2, 3    
# 3 Therapeutic justice: Life inside drug court                                                              2, 3    
# 4 Tuberculosis screening in a novel substance abuse treatment center in Malaysia: Implications for a comp… 4       
# 5 An ecosystem for improving the quality of personal health records                                        5       
# 6 Patterns of attachment and alcohol abuse in sexual and violent non-sexual offenders                      6       
# 7 A Model for the Assessment of Static and Dynamic Factors in Sexual Offenders                             7, 8    
# 8 A model for the assessment of static and dynamic factors in sexual offenders                             7, 8    
# 9 The problem of co-occurring disorders among jail detainees: Antisocial disorder, alcoholism, drug abuse… 9       
#10 Co-occurring disorders among mentally ill jail detainees. Implications for public policy                 10      
#11 Comorbidity and Continuity of Psychiatric Disorders in Youth After Detention: A Prospective Longitudina… 11      
#12 Behavioral Health and Adult Milestones in Young Adults With Perinatal HIV Infection or Exposure          12, 13  
#13 Behavioral health and adult milestones in young adults with perinatal HIV infection or exposure          12, 13  
#14 Revising the paradigm for jail diversion for people with mental and substance use disorders: Intercept 0 14      
#15 Diagnosis of active and latent tuberculosis: summary of NICE guidance                                    15      
#16 Towards tackling tuberculosis in vulnerable groups in the European Union: the E-DETECT TB consortium     16

Fuzzy Join with Partial String Match in R

We can use fuzzyjoin. Do a regex_left_join after getting the substring from the 'course' columns in both dataset (to make it more matchable)

library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>% 
   mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>% 
   regex_left_join(df1 %>%
       mutate(grp = toupper(str_remove(course, 
     "\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>% 
   select(student_id = student_id.x, course, grade)

-output

# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C

OP's expected output is

 df_final
# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C

Fuzzy Match Across Columns in R

There is in the package stringdist a function stingsim which gives you a number between 0 and 1 for similarities between strings.

Name.1 <- c("gonzalez", "wassermanschultz", "athanasopoulos", "armato")
Name.2 <- c("gonzalezsoldevilla", "schultz", "anthanasopoulos", "strain")
library(stringdist)

df1 <- data.frame(Name.1, Name.2)
df1$similar <- stringsim(Name.1, Name.2)
df1
#>             Name.1             Name.2   similar
#> 1         gonzalez gonzalezsoldevilla 0.4444444
#> 2 wassermanschultz            schultz 0.4375000
#> 3   athanasopoulos    anthanasopoulos 0.9333333
#> 4           armato             strain 0.1666667

fuzzy version of stringr::str_detect for filtering dataframe

You can use agrepl for Approximate String Matching (Fuzzy Matching) which is in base.

example_data[agrep(paste(search_terms, collapse = "|"),
  example_data$disease_phase, 2, ignore.case=TRUE, fixed=FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

Or using Reduce instead of | in the regex.

example_data[Reduce(\(y, x) y | agrepl(x, example_data$disease_phase, 2,
  ignore.case=TRUE), search_terms, FALSE),]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

An alternative might be adist, also in base, which calculates a distance matrix - so it might not be recommended for larger vectors, as the matrix can get large. Here I also choose that a mismatch by 2 characters will be OK.

example_data[colSums(adist(unique(search_terms), example_data$disease_phase,
                           partial=TRUE) < 3) > 0,]
#  project               disease_phase   startdate
#1    A111       Diabetes, Preclinical   01DEC2018
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017

In case only single words are compared it might be more efficient so split the disease_phase into words using strsplit also in base.

. <- strsplit(example_data$disease_phase, "[ ,;]+")
. <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
example_data[unique(unlist(.[Reduce(\(y, x) `[<-`(y, !y, agrepl(x, names(.)[!y],
   2)), tolower(search_terms), logical(length(.)))], FALSE, FALSE)),]
#example_data[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
#   tolower(search_terms), FALSE)], FALSE, FALSE)),] #Alternative
#  project               disease_phase   startdate
#2    A123 Lipid lowering, Perlcinical 17-OKT-2017
#1    A111       Diabetes, Preclinical   01DEC2018

Some simpler examples using agrep:

#Allow 1 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 1)
#[1]  TRUE  TRUE FALSE

#Allow 2 character difference to make match
agrepl("preclinical", c("precinical", "precinicalxyz", "prelcinical"), 2)
#[1] TRUE TRUE TRUE

#Use boundaries to match words
agrepl("\\bpreclinical\\b", c("xyz precinical xyz", "xyzpreclinicalxyz"), 1, fixed=FALSE)
#[1]  TRUE FALSE

How much difference will be allowed can be set with max.distance:

max.distance: Maximum distance allowed for a match.  Expressed either
          as integer, or as a fraction of the _pattern_ length times
          the maximal transformation cost (will be replaced by the
          smallest integer not less than the corresponding fraction),
          or a list with possible components

          ‘cost’: maximum number/fraction of match cost (generalized
              Levenshtein distance)

          ‘all’: maximal number/fraction of _all_ transformations
              (insertions, deletions and substitutions)

          ‘insertions’: maximum number/fraction of insertions

          ‘deletions’: maximum number/fraction of deletions

          ‘substitutions’: maximum number/fraction of substitutions

And also a Benchmark based on @JBGruber:

system.time({  #Libraries needed for method of JBGruber
library(dplyr);
library(stringdist);
library(Rfast);
library(tidytext)
})
#       User      System verstrichen 
#      1.008       0.040       1.046 

set.seed(42)
example_large <- example_data %>% sample_n(5000, replace = TRUE)

stringdist_detect <- function(a, b, method = "osa", thres = 2) {
  Rfast::rowMins(stringdist::stringdistmatrix(a, b, method = method), value = TRUE) <= thres
}

bench::mark(check = FALSE,
  stringdist_detect = {
     example_large %>% 
      tidytext::unnest_tokens(output = "word", input = "disease_phase", drop = FALSE) %>% 
      filter(stringdist_detect(word, tolower(search_terms), method = "lv"))
  },
  GKi ={. <- strsplit(example_large$disease_phase, "[ ,;]+")
   . <- split(rep(seq_along(.), lengths(.)), tolower(unlist(.)))
   example_large[unique(unlist(.[Reduce(\(y, x) y | agrepl(x, names(.), 2),
     tolower(search_terms), FALSE)], FALSE, FALSE)),]
})
#  expression             min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr>        <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 stringdist_detect  17.42ms  18.65ms      52.8    7.15MB    19.4     19     7
#2 GKi                 5.64ms   6.04ms     165.   869.08KB     6.27    79     3

Also much time could be saved when there is only one, right written, variant of the words of interest in search_terms.