How to Match Fuzzy Match Strings from Two Datasets

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
  x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
  # 1 for already assigned, -1 for unassigned and unassignable
  while(any(x==0)){
    min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
    a_sel <- a[d==min_d & x==0][1] 
    b_sel <- b[d==min_d & a == a_sel & x==0][1] 
    x[a==a_sel & b == b_sel] <- 1
    x[x==0 & (a==a_sel|b==b_sel)] <- -1
  }
  cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co    Ace Co. 0.04762
2  Bayes Bayes Inc. 0.16667
3    asd       asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1  Ace Co    Ace Co. 0.04762
11   Baes Bayes Inc. 0.20000
8   Bayes Bayes Inc. 0.16667
12   Bays Bayes Inc. 0.20000
10    Bcy Bayes Inc. 0.37778
15    asd       asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

Approximate string matching in R between two datasets

You can do the fuzzy matching via agrep(), which here I've used for each title with lapply() to generate a logical vector of matches for each Text, and then used an apply() across a data.frame from this match to create the vector of matched titles.

You can tweak the max.distance value but this worked just fine on your example.

dt1 <- data.frame(
  title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
  genre = c("Dramas", "Action & Adventure", "Kids' TV"),
  stringsAsFactors = FALSE
)

dt2 <- data.frame(
  id = 1:5,
  Text = c(
    "I really liked V for Vendetta",
    "Bottersnikes & Gumbles was a great film .... ",
    "In any case, in my opinion bottersnikes &gumbles was a great film ...",
    "@thewitcher was an interesting series",
    "Secret in Their Eye is a terrible film! but I Like V per Vendetta"
  ),
  stringsAsFactors = FALSE
)

match_titles <- function(target, titles) {
  matches <- lapply(titles, agrepl, target,
    max.distance = 0.3,
    ignore.case = TRUE, fixed = TRUE
  )
  matched_titles <- apply(
    data.frame(matches), 1,
    function(y) paste(titles[y], collapse = ",")
  )
  matched_titles
}

dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
##   id                                                                  Text
## 1  1                                         I really liked V for Vendetta
## 2  2                         Bottersnikes & Gumbles was a great film .... 
## 3  3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4  4                                 @thewitcher was an interesting series
## 5  5     Secret in Their Eye is a terrible film! but I Like V per Vendetta
##                                titles
## 1                      V for Vendetta
## 2              Bottersnikes & Gumbles
## 3              Bottersnikes & Gumbles
## 4                                    
## 5 Secret in Their Eyes,V for Vendetta

Fuzzy Matching with different fuzz ratios

Here is one way to do it:

import pandas as pd
from fuzzywuzzy import fuzz

# Setup
df1.columns = [f"df1_{col}" for col in df1.columns]

# Add new columns
df1["fuzz_ratio_lname"] = (
    df1["df1_lname"]
    .apply(
        lambda x: max(
            [(value, fuzz.ratio(x, value)) for value in df2["lname"]],
            key=lambda x: x[1],
        )
    )
    .apply(lambda x: x if x[1] > 75 else pd.NA)
)

df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
    df1["fuzz_ratio_lname"].tolist(), index=df1.index
)
df1 = (
    pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
    .drop(columns="lname")
    .rename(columns={"fname": "df2_fname"})
)

df1["df2_fname"] = df1["df2_fname"].fillna(value="")
for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
    ratio = fuzz.ratio(x, value)
    df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA

# Cleanup
df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
df1 = df1[
    [
        "df1_ein",
        "df1_ein_name",
        "df1_lname",
        "df1_fname",
        "fuzz_ratio_lname",
        "fuzz_ratio_fname",
        "df2_lname",
        "df2_fname",
        "score",
    ]
]

print(df1)
# Output
   df1_ein         df1_ein_name df1_lname df1_fname  fuzz_ratio_lname  \
0     1001       H for Humanity    Cooper   Bradley              83.0   
1     1500          Labor Union    Cruise    Thomas             100.0   
2     3000  Something something      Pitt      Brad               NaN   

  fuzz_ratio_fname df2_lname df2_fname  score  
0             62.0    Couper    M Brad    2.5  
1             67.0    Cruise       Tom    3.5  
2             <NA>      <NA>      <NA>    NaN

Join two large datasets in R using both exact and fuzzy matching

Maybe something like the following will do what the question asks for. It uses package stringdist, not fuzzyjoin.

First, merge by Postcode only, since the matches are exact. Then get similarities between Name's. If they are above a predetermined threshold, keep those rows.

thresh <- 0.75

df_final <- merge(df2[c(1, 6:7)], df1[c(1, 6)], by = "Postcode", suffixes = c("",".y"))
i <- apply(df_final[c(2, 4)], 1, function(x) {stringdist::stringsim(x[1], x[2], method = 'jw')}) >= thresh

df_final <- df_final[i, c(2, 1, 3)]

df_final 
#               Name Postcode RatingValue
#1      BETA Limited  BN1 6LD           3
#2 Giga Incorporated   G2 8LY           5
#3              ACME  PA4 8QU           1

How to Match Fuzzy Match Strings from Two Datasets

How can I match fuzzy match strings from two datasets?

Approximate string matching in R between two datasets

Fuzzy Matching with different fuzz ratios

Join two large datasets in R using both exact and fuzzy matching

Related Topics

Leave a reply