R: I have to do Softmatch in String
agrep
is definitely a quick and easy base R solution if you have just a bit of data. If this is just a toy example of a larger data frame, you may be interested in a more durable tool. In the past month, learning about the Levenshtein distance noted by @PaulHiemstra (also in these different questions) led me to the RecordLinkage package. The vignettes leave me wanting more examples of the "soft" or fuzzy" matches, particularly across more than 1 field, but the basic answer to your question could be somthing like:
library(RecordLinkage)
col <- data.frame(names1 = c("John Collingson","J Collingson","Dummy Name1","Dummy Name2"))
inputText <- data.frame(names2 = c("J Collingson"))
g1 <- compare.linkage(inputText, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6)
# id names2 Weight
# 1 1 J Collingson
# 2 2 J Collingson 1.000
# 3
# 4 1 J Collingson
# 5 1 John Collingson 0.815
inputText2 <- data.frame(names2 = c("Jon Collinson"))
g1 <- compare.linkage(inputText2, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6)
# id names2 Weight
# 1 1 Jon Collinson
# 2 1 John Collingson 0.9644444
# 3
# 4 1 Jon Collinson
# 5 2 J Collingson 0.7924825
Please start with compare.linkage() or compare.dedup()-- RLBigDataLinkage() or RLBigDataDedup() for large data sets. Hope this helps.
Merging two Data Frames using Fuzzy/Approximate String Matching in R
Approximate string matching is not a good idea since an incorrect match would invalidate the whole analysis. If the names from each source is the same each time, then building indexes seems the best option to me too. This is easily done in R:
Suppose you have the data:
a<-data.frame(name=c('Ace','Bayes'),price=c(10,13))
b<-data.frame(name=c('Ace Co.','Bayes Inc.'),qty=c(9,99))
Build an index of names for each source one time, perhaps using pmatch etc. as a starting point and then validating manually.
a.idx<-data.frame(name=c('Ace','Bayes'),idx=c(1,2))
b.idx<-data.frame(name=c('Ace Co.','Bayes Inc.'), idx=c(1,2))
Then for each run merge using:
a.rich<-merge(a,a.idx,by="name")
b.rich<-merge(b,b.idx,by="name")
merge(a.rich,b.rich,by="idx")
Which would give us:
idx name.x price name.y qty
1 1 Ace 10 Ace Co. 9
2 2 Bayes 13 Bayes Inc. 99
How to merge two data frames using (parts of) text values?
Try the RecordLinkage
package.
Here is a possible solution where the merge works based on generally how "close" the two "words" match:
library(reshape2)
library(RecordLinkage)
set.seed(16)
l <- LETTERS[1:10]
ex1 <- data.frame(lets = paste(l, l, l, sep = ""), nums = 1:10)
ex2 <- data.frame(lets = paste(sample(l), sample(l), sample(l), sep = ""),
nums = 11:20)
ex1
# lets nums
# 1 AAA 1
# 2 BBB 2
# 3 CCC 3
# 4 DDD 4
# 5 EEE 5
# 6 FFF 6
# 7 GGG 7
# 8 HHH 8
# 9 III 9
# 10 JJJ 10
ex2
# lets nums
# 1 GDJ 11
# 2 CFH 12
# 3 DBE 13
# 4 BED 14
# 5 FJB 15
# 6 JHG 16
# 7 AII 17
# 8 ICC 18
# 9 EGF 19
# 10 HAA 20
lets <- melt(outer(ex1$lets, ex2$lets, FUN = "levenshteinDist"))
lets <- lets[lets$value < 2, ] # adjust the "< 2" as necessary
cbind(ex1[lets$Var1, ], ex2[lets$Var2, ])
# lets nums lets nums
# 9 III 9 AII 17
# 3 CCC 3 ICC 18
# 1 AAA 1 HAA 20
How can I match fuzzy match strings from two datasets?
The solution depends on the desired cardinality of your matching a
to b
. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.
One-to-one case (requires assignment algorithm):
When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim
.
Not familiar with AGREP but here's example using stringdist
for your distance matrix.
library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
Produces the assignment:
a b d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333
I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.
Many-to-one case (not an assignment problem):
do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
Produces the result:
a_name b_name dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333
Edit: use method="jw"
to produce desired results. See help("stringdist-package")
Related Topics
R Calculate the Average of One Column Corresponding to Each Bin of Another Column
Can You More Clearly Explain Lazy Evaluation in R Function Operators
How to Save a Data Frame in a Txt or Excel File Separated by Columns
Unpacking and Merging Lists in a Column in Data.Frame
Mapping the Shortest Flight Path Across the Date Line in R Leaflet/Shiny, Using Gcintermediate
How to Simultaneously Apply Color/Shape/Size in a Scatter Plot Using Plotly
Converting Between Matrix Subscripts and Linear Indices (Like Ind2Sub/Sub2Ind in Matlab)
Ggplot Piecharts on a Ggmap: Labels Destroy the Small Plots
Note or Warning from Package Check When Readme.Md Includes Images
How to Load Xlsx File Using Fread Function
Specifying the Colour Scale for Maps in Ggplot
Is the Plyr Package for R Not Available for R Version 3.0.2
Knitr Inline Chunk Options (No Evaluation) or Just Render Highlighted Code
Plot Curved Lines Between Two Locations in Ggplot2
Check Whether All Elements of a List Are in Equal in R
Inserting a Table Under the Legend in a Ggplot2 and Saving Everything to a File