R: String Fuzzy Matching Using Jarowinkler

R: String Fuzzy Matching using jarowinkler

Using a toy example:

library(RecordLinkage)
library(dplyr)

ref <- c('cat', 'dog', 'turtle', 'cow', 'horse', 'pig', 'sheep', 'koala','bear','fish')
words <- c('dog', 'kiwi', 'emu', 'pig', 'sheep', 'cow','cat','horse')

wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE)
wordlist %>% group_by(words) %>% mutate(match_score = jarowinkler(words, ref)) %>%
summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])

gives

 words     match matched_to
1 cat 1.0000000 cat
2 cow 1.0000000 cow
3 dog 1.0000000 dog
4 emu 0.5277778 bear
5 horse 1.0000000 horse
6 kiwi 0.5350000 koala
7 pig 1.0000000 pig
8 sheep 1.0000000 sheep

Edit: As a response to the OP's comment, the last command uses the pipeline approach from dplyr, and groups every combination of the raw words and references by the raw words, adds a column match_score with the jarowinkler score, and returns only a summary of the highest match score (indexed by which.max(match_score)), as well as the reference which also is indexed by the maximum match_score.

Text Mining using Jaro-Winkler fuzzy matching in R

Your data.frames:

notes<-data.frame(NoteID=c("a1","b2","c3","d4","e5","a1","b2","c3","d4","e5"),
word=c("hit","hot","shirt","than","thought","hat","get","shirt","that","tough"))
terms<-data.frame(Category=c("a","b","c","d","e"),
word=c("hot","got","shot","that","though"))

Use stringdistmatrix (package stringdist) with method "jw" (jarowinkler)

library(stringdist)
dist<-stringdistmatrix(notes$word,terms$word,method = "jw")
row.names(dist)<-as.character(notes$word)
colnames(dist)<-as.character(terms$word)

Now you have all distances:

dist
hot got shot that though
hit 0.2222222 0.4444444 0.27777778 0.27777778 0.50000000
hot 0.0000000 0.2222222 0.08333333 0.27777778 0.33333333
shirt 0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
than 0.4722222 1.0000000 0.50000000 0.16666667 0.38888889
thought 0.3571429 0.5158730 0.40476190 0.40476190 0.04761905
hat 0.2222222 0.4444444 0.27777778 0.08333333 0.50000000
get 0.4444444 0.2222222 0.47222222 0.47222222 0.50000000
shirt 0.4888889 1.0000000 0.21666667 0.36666667 0.54444444
that 0.2777778 0.4722222 0.33333333 0.00000000 0.38888889
tough 0.4888889 0.4888889 0.51666667 0.51666667 0.05555556

Find the word more close to notes

output<-cbind(notes,word_close=terms[as.numeric(apply(dist, 1, which.min)),"word"],dist_min=apply(dist, 1, min))
output
NoteID word word_close dist_min
1 a1 hit hot 0.22222222
2 b2 hot hot 0.00000000
3 c3 shirt shot 0.21666667
4 d4 than that 0.16666667
5 e5 thought though 0.04761905
6 a1 hat that 0.08333333
7 b2 get got 0.22222222
8 c3 shirt shot 0.21666667
9 d4 that that 0.00000000
10 e5 tough though 0.05555556

If you want have just in word_close the words under a certain distance threshold (in this case 0.1), you can do this:

output[output$dist_min>=0.1,c("word_close","dist_min")]<-NA
output
NoteID word word_close dist_min
1 a1 hit <NA> NA
2 b2 hot hot 0.00000000
3 c3 shirt <NA> NA
4 d4 than <NA> NA
5 e5 thought though 0.04761905
6 a1 hat that 0.08333333
7 b2 get <NA> NA
8 c3 shirt <NA> NA
9 d4 that that 0.00000000
10 e5 tough though 0.05555556

Fuzzy string matching in r

What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.

dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

JaroWinkler Method -- Identifying Character/Numeric spots in a string

As suggested in the comment above, I would do an exact string matching. Only uncertainty for now is what do you mean with "characters"? Only letters from A-Z or als e.g. punctuations? If only letters, see the code below.

library(tidyverse)

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{6})")

which gives:

[1]  TRUE  TRUE FALSE  TRUE

Updating the answer to reflect the TOs changed pattern

words <-c("456GHIJKL","123ABCDEF","78D78DAA2","660ABCDEF", "660A7CDEF")

str_detect(words, "[[:digit:]]{3}(?=[[:alpha:]]{1})(?=[[:digit:]]{1}|[[:alpha:]]{1})(?=[[:alpha:]]{5})")

gives:

[1]  TRUE  TRUE FALSE  TRUE  TRUE

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

Matching strings with abbreviations; fuzzy matching

Changing the dissimilarity measure to the Jaro distance or Jaro-Winkler
distance works for the example provided in your question.

library(stringdist)

vec.a <- c("ce", "amer", "principl")
vec.b <- c("ceo", "american", "principal")

amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = 0) # Jaro
#> [1] 1 2 3
amatch(vec.a, vec.b, maxDist = 1, method = "jw", p = .2) # Jaro-Winkler
#> [1] 1 2 3

Jaro-Winkler's difference between packages

Tucked away in the documentation for stringdist is the following:

The Jaro-Winkler distance (method=jw, 0<p<=0.25) adds a correction term to the Jaro-distance. It is defined as d − l · p · d, where d is the Jaro-distance. Here, l is obtained by counting, from the start of the input strings, after how many characters the first character mismatch between the two strings occurs, with a maximum of four. The factor p is a penalty factor, which in the work of Winkler is often chosen 0.1.

However, in stringdist::stringdist, p = 0 by default. Hence:

1 - stringdist("advil", c("advi", "advill", "advil", "dvil", "sdvil"), 
method = "jw", p = .1)
# [1] 0.9600000 0.9666667 1.0000000 0.9333333 0.8666667

In fact that value is hard-coded in the source of RecordLinkage::jarowinkler.

Fuzzy string matching of a list of character vectors to a character vector

using purrr package

mylist <- setNames(mylist, c('a', 'b', 'c'))

library(purrr)

map_dfr(charvec,
function(wrd, vec_list){
setNames(map_df(vec_list, ~max(jarowinkler(wrd, .x))),
names(vec_list)
)

},
mylist)

# A tibble: 10 x 3
a b c
<dbl> <dbl> <dbl>
1 0.911 0.580 0.603
2 0.85 0.713 0.603
3 0.842 0.557 0.515
4 0.657 0.490 0.409
5 0.912 0.489 0.659
6 0.538 0.546 0.801
7 0.716 0.547 0.740
8 0.591 0.524 0.856
9 0.675 0.509 0.821
10 0.619 0.587 0.630

If you'd like it wide:

map_dfc(charvec,
function(wrd, vec_list) {
set_names(list(map_dbl(vec_list, ~max(jarowinkler(wrd, .x)))),
wrd)
},
mylist
)

# A tibble: 3 x 10
`brown dog` `lazy cat` `white dress` `I know that` `excuse me plea~ `tall person` `new building` `good example`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.911 0.85 0.842 0.657 0.912 0.538 0.716 0.591
2 0.580 0.713 0.557 0.490 0.489 0.546 0.547 0.524
3 0.603 0.603 0.515 0.409 0.659 0.801 0.740 0.856
# ... with 2 more variables: `green with envy` <dbl>, `zebra crossing` <dbl>


Related Topics



Leave a reply



Submit