How can I match fuzzy match strings from two datasets?
The solution depends on the desired cardinality of your matching a
to b
. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
One-to-one case (requires assignment algorithm):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim
.
Not familiar with AGREP but here's example using stringdist
for your distance matrix.
library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
Produces the assignment:
a b d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
Many-to-one case (not an assignment problem):
do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
a_name b_name dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333
Edit: use method="jw"
to produce desired results. See help("stringdist-package")
Approximate string matching in R between two datasets
You can do the fuzzy matching via agrep()
, which here I've used for each title with lapply()
to generate a logical vector of matches for each Text, and then used an apply()
across a data.frame from this match to create the vector of matched titles.
You can tweak the max.distance
value but this worked just fine on your example.
dt1 <- data.frame(
title = c("Secret in Their Eyes", "V for Vendetta", "Bottersnikes & Gumbles"),
genre = c("Dramas", "Action & Adventure", "Kids' TV"),
stringsAsFactors = FALSE
)
dt2 <- data.frame(
id = 1:5,
Text = c(
"I really liked V for Vendetta",
"Bottersnikes & Gumbles was a great film .... ",
"In any case, in my opinion bottersnikes &gumbles was a great film ...",
"@thewitcher was an interesting series",
"Secret in Their Eye is a terrible film! but I Like V per Vendetta"
),
stringsAsFactors = FALSE
)
match_titles <- function(target, titles) {
matches <- lapply(titles, agrepl, target,
max.distance = 0.3,
ignore.case = TRUE, fixed = TRUE
)
matched_titles <- apply(
data.frame(matches), 1,
function(y) paste(titles[y], collapse = ",")
)
matched_titles
}
dt2$titles <- match_titles(dt2$Text, dt1$title)
dt2
## id Text
## 1 1 I really liked V for Vendetta
## 2 2 Bottersnikes & Gumbles was a great film ....
## 3 3 In any case, in my opinion bottersnikes &gumbles was a great film ...
## 4 4 @thewitcher was an interesting series
## 5 5 Secret in Their Eye is a terrible film! but I Like V per Vendetta
## titles
## 1 V for Vendetta
## 2 Bottersnikes & Gumbles
## 3 Bottersnikes & Gumbles
## 4
## 5 Secret in Their Eyes,V for Vendetta
Fuzzy Matching with different fuzz ratios
Here is one way to do it:
import pandas as pd
from fuzzywuzzy import fuzz
# Setup
df1.columns = [f"df1_{col}" for col in df1.columns]
# Add new columns
df1["fuzz_ratio_lname"] = (
df1["df1_lname"]
.apply(
lambda x: max(
[(value, fuzz.ratio(x, value)) for value in df2["lname"]],
key=lambda x: x[1],
)
)
.apply(lambda x: x if x[1] > 75 else pd.NA)
)
df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
df1["fuzz_ratio_lname"].tolist(), index=df1.index
)
df1 = (
pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
.drop(columns="lname")
.rename(columns={"fname": "df2_fname"})
)
df1["df2_fname"] = df1["df2_fname"].fillna(value="")
for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
ratio = fuzz.ratio(x, value)
df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA
# Cleanup
df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
df1 = df1[
[
"df1_ein",
"df1_ein_name",
"df1_lname",
"df1_fname",
"fuzz_ratio_lname",
"fuzz_ratio_fname",
"df2_lname",
"df2_fname",
"score",
]
]
print(df1)
# Output
df1_ein df1_ein_name df1_lname df1_fname fuzz_ratio_lname \
0 1001 H for Humanity Cooper Bradley 83.0
1 1500 Labor Union Cruise Thomas 100.0
2 3000 Something something Pitt Brad NaN
fuzz_ratio_fname df2_lname df2_fname score
0 62.0 Couper M Brad 2.5
1 67.0 Cruise Tom 3.5
2 <NA> <NA> <NA> NaN
Join two large datasets in R using both exact and fuzzy matching
Maybe something like the following will do what the question asks for. It uses package stringdist
, not fuzzyjoin
.
First, merge
by Postcode
only, since the matches are exact. Then get similarities between Name
's. If they are above a predetermined threshold, keep those rows.
thresh <- 0.75
df_final <- merge(df2[c(1, 6:7)], df1[c(1, 6)], by = "Postcode", suffixes = c("",".y"))
i <- apply(df_final[c(2, 4)], 1, function(x) {stringdist::stringsim(x[1], x[2], method = 'jw')}) >= thresh
df_final <- df_final[i, c(2, 1, 3)]
df_final
# Name Postcode RatingValue
#1 BETA Limited BN1 6LD 3
#2 Giga Incorporated G2 8LY 5
#3 ACME PA4 8QU 1
Related Topics
Select Groups With More Than One Distinct Value
How to Put Labels Over Geom_Bar in R With Ggplot2
Convert Column With Pipe Delimited Data into Dummy Variables
Error in ≪My Code≫: Target of Assignment Expands to Non-Language Object
How to Save Plots That Are Made in a Shiny App
Control Ggplot2 Legend Look Without Affecting the Plot
How to Subtract/Add Days From/To a Date
Get "Embedded Nul(S) Found in Input" When Reading a CSV Using Read.Csv()
Dynamically Add Plots to Web Page Using Shiny
How to Load Packages in R Automatically
Painless Way to Install a New Version of R
Why Is the Parallel Package Slower Than Just Using Apply
Collapsing Rows Where Some Are All Na, Others Are Disjoint With Some Nas
R Shiny Passing Reactive to Selectinput Choices
R - Concatenate Two Dataframes
Dplyr: How to Use Group_By Inside a Function
Adding Minor Tick Marks to the X Axis in Ggplot2 (With No Labels)