R: String Fuzzy Matching Using Jarowinkler

R: String Fuzzy Matching using jarowinkler

Using a toy example:

library(RecordLinkage)
library(dplyr)

ref <- c('cat', 'dog', 'turtle', 'cow', 'horse', 'pig', 'sheep', 'koala','bear','fish')
words <- c('dog', 'kiwi', 'emu', 'pig', 'sheep', 'cow','cat','horse')

wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)
wordlist %>% group_by(words) %>% mutate(match_score = jarowinkler(words, ref)) %>%
summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])

gives

 words     match matched_to
1   cat 1.0000000        cat
2   cow 1.0000000        cow
3   dog 1.0000000        dog
4   emu 0.5277778       bear
5 horse 1.0000000      horse
6  kiwi 0.5350000      koala
7   pig 1.0000000        pig
8 sheep 1.0000000      sheep

Edit: As a response to the OP's comment, the last command uses the pipeline approach from dplyr, and groups every combination of the raw words and references by the raw words, adds a column match_score with the jarowinkler score, and returns only a summary of the highest match score (indexed by which.max(match_score)), as well as the reference which also is indexed by the maximum match_score.

Text Mining using Jaro-Winkler fuzzy matching in R

Your data.frames:

notes<-data.frame(NoteID=c("a1","b2","c3","d4","e5","a1","b2","c3","d4","e5"),
                  word=c("hit","hot","shirt","than","thought","hat","get","shirt","that","tough"))
terms<-data.frame(Category=c("a","b","c","d","e"),
                  word=c("hot","got","shot","that","though"))

Use stringdistmatrix (package stringdist) with method "jw" (jarowinkler)

library(stringdist)
dist<-stringdistmatrix(notes$word,terms$word,method = "jw")
row.names(dist)<-as.character(notes$word)
colnames(dist)<-as.character(terms$word)

Now you have all distances:

dist
              hot       got       shot       that     though
hit     0.2222222 0.4444444 0.27777778 0.27777778 0.50000000
hot     0.0000000 0.2222222 0.08333333 0.27777778 0.33333333
shirt   0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
than    0.4722222 1.0000000 0.50000000 0.16666667 0.38888889
thought 0.3571429 0.5158730 0.40476190 0.40476190 0.04761905
hat     0.2222222 0.4444444 0.27777778 0.08333333 0.50000000
get     0.4444444 0.2222222 0.47222222 0.47222222 0.50000000
shirt   0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
that    0.2777778 0.4722222 0.33333333 0.00000000 0.38888889
tough   0.4888889 0.4888889 0.51666667 0.51666667 0.05555556

Find the word more close to notes

output<-cbind(notes,word_close=terms[as.numeric(apply(dist, 1, which.min)),"word"],dist_min=apply(dist, 1, min))
output
       NoteID    word word_close   dist_min
    1      a1     hit        hot 0.22222222
    2      b2     hot        hot 0.00000000
    3      c3   shirt       shot 0.21666667
    4      d4    than       that 0.16666667
    5      e5 thought     though 0.04761905
    6      a1     hat       that 0.08333333
    7      b2     get        got 0.22222222
    8      c3   shirt       shot 0.21666667
    9      d4    that       that 0.00000000
    10     e5   tough     though 0.05555556

If you want have just in word_close the words under a certain distance threshold (in this case 0.1), you can do this:

output[output$dist_min>=0.1,c("word_close","dist_min")]<-NA
output
   NoteID    word word_close   dist_min
1      a1     hit       <NA>         NA
2      b2     hot        hot 0.00000000
3      c3   shirt       <NA>         NA
4      d4    than       <NA>         NA
5      e5 thought     though 0.04761905
6      a1     hat       that 0.08333333
7      b2     get       <NA>         NA
8      c3   shirt       <NA>         NA
9      d4    that       that 0.00000000
10     e5   tough     though 0.05555556

Fuzzy string matching in r

What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.

dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

JaroWinkler Method -- Identifying Character/Numeric spots in a string

As suggested in the comment above, I would do an exact string matching. Only uncertainty for now is what do you mean with "characters"? Only letters from A-Z or als e.g. punctuations? If only letters, see the code below.

library(tidyverse)

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{6})")

which gives:

[1]  TRUE  TRUE FALSE  TRUE

Updating the answer to reflect the TOs changed pattern

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF", "660A7CDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{1})(?=[[:digit:]]{1}|[[:alpha:]]{1})(?=[[:alpha:]]{5})")

gives:

[1]  TRUE  TRUE FALSE  TRUE  TRUE

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
  x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
  # 1 for already assigned, -1 for unassigned and unassignable
  while(any(x==0)){
    min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
    a_sel <- a[d==min_d & x==0][1] 
    b_sel <- b[d==min_d & a == a_sel & x==0][1] 
    x[a==a_sel & b == b_sel] <- 1
    x[x==0 & (a==a_sel|b==b_sel)] <- -1
  }
  cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co    Ace Co. 0.04762
2  Bayes Bayes Inc. 0.16667
3    asd       asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1  Ace Co    Ace Co. 0.04762
11   Baes Bayes Inc. 0.20000
8   Bayes Bayes Inc. 0.16667
12   Bays Bayes Inc. 0.20000
10    Bcy Bayes Inc. 0.37778
15    asd       asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

Matching strings with abbreviations; fuzzy matching

Changing the dissimilarity measure to the Jaro distance or Jaro-Winkler
distance works for the example provided in your question.

library(stringdist)

vec.a <- c("ce", "amer", "principl")
vec.b <- c("ceo", "american", "principal")

amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = 0) # Jaro
#> [1] 1 2 3
amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = .2) # Jaro-Winkler
#> [1] 1 2 3

Jaro-Winkler's difference between packages

Tucked away in the documentation for stringdist is the following:

The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d − l · p · d, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a penalty factor, which in the work of Winkler is often chosen 0.1.

However, in stringdist::stringdist, p = 0 by default. Hence:

1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), 
               method = "jw", p = .1)
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667

In fact that value is hard-coded in the source of RecordLinkage::jarowinkler.

Fuzzy string matching of a list of character vectors to a character vector

using purrr package

mylist <- setNames(mylist, c('a', 'b', 'c'))

library(purrr)

map_dfr(charvec,
    function(wrd, vec_list){
      setNames(map_df(vec_list, ~max(jarowinkler(wrd, .x))),
               names(vec_list)
      )

    },
    mylist)

# A tibble: 10 x 3
       a     b     c
   <dbl> <dbl> <dbl>
 1 0.911 0.580 0.603
 2 0.85  0.713 0.603
 3 0.842 0.557 0.515
 4 0.657 0.490 0.409
 5 0.912 0.489 0.659
 6 0.538 0.546 0.801
 7 0.716 0.547 0.740
 8 0.591 0.524 0.856
 9 0.675 0.509 0.821
10 0.619 0.587 0.630

If you'd like it wide:

map_dfc(charvec,
         function(wrd, vec_list) {
          set_names(list(map_dbl(vec_list, ~max(jarowinkler(wrd, .x)))),
                    wrd)
         },
        mylist
)

# A tibble: 3 x 10
  `brown dog` `lazy cat` `white dress` `I know that` `excuse me plea~ `tall person` `new building` `good example`
        <dbl>      <dbl>         <dbl>         <dbl>            <dbl>         <dbl>          <dbl>          <dbl>
1       0.911      0.85          0.842         0.657            0.912         0.538          0.716          0.591
2       0.580      0.713         0.557         0.490            0.489         0.546          0.547          0.524
3       0.603      0.603         0.515         0.409            0.659         0.801          0.740          0.856
# ... with 2 more variables: `green with envy` <dbl>, `zebra crossing` <dbl>

R: String Fuzzy Matching Using Jarowinkler