Joining Two Datasets Using Fuzzy Logic

Joining two datasets using fuzzy logic

I was able to join the two datasets, using the fuzzyjoin package:

library(fuzzyjoin)
stringdist_inner_join(Dataset1, Data2,
by ="Name", distance_col = NULL)

Join two large datasets in R using both exact and fuzzy matching

Maybe something like the following will do what the question asks for. It uses package stringdist, not fuzzyjoin.

First, merge by Postcode only, since the matches are exact. Then get similarities between Name's. If they are above a predetermined threshold, keep those rows.

thresh <- 0.75

df_final <- merge(df2[c(1, 6:7)], df1[c(1, 6)], by = "Postcode", suffixes = c("",".y"))
i <- apply(df_final[c(2, 4)], 1, function(x) {stringdist::stringsim(x[1], x[2], method = 'jw')}) >= thresh

df_final <- df_final[i, c(2, 1, 3)]

df_final
# Name Postcode RatingValue
#1 BETA Limited BN1 6LD 3
#2 Giga Incorporated G2 8LY 5
#3 ACME PA4 8QU 1

is it possible to do fuzzy match merge with python pandas?

Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e

In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e

.

If these were columns, in the same vein you could apply to the column then merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")



Related Topics



Leave a reply



Submit