R: I Have to Do Softmatch in String

R: I have to do Softmatch in String

agrep is definitely a quick and easy base R solution if you have just a bit of data. If this is just a toy example of a larger data frame, you may be interested in a more durable tool. In the past month, learning about the Levenshtein distance noted by @PaulHiemstra (also in these different questions) led me to the RecordLinkage package. The vignettes leave me wanting more examples of the "soft" or fuzzy" matches, particularly across more than 1 field, but the basic answer to your question could be somthing like:

library(RecordLinkage)
col <- data.frame(names1 = c("John Collingson","J Collingson","Dummy Name1","Dummy Name2"))
inputText <- data.frame(names2 = c("J Collingson"))
g1 <- compare.linkage(inputText, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6) 
# id          names2 Weight
# 1  1    J Collingson       
# 2  2    J Collingson  1.000
# 3                          
# 4  1    J Collingson       
# 5  1 John Collingson  0.815

inputText2 <- data.frame(names2 = c("Jon Collinson"))
g1 <- compare.linkage(inputText2, col, strcmp = T)
g2 <- epiWeights(g1)
getPairs(g2, min.weight=0.6)
# id          names2    Weight
# 1  1   Jon Collinson          
# 2  1 John Collingson 0.9644444
# 3                             
# 4  1   Jon Collinson          
# 5  2    J Collingson 0.7924825

Please start with compare.linkage() or compare.dedup()-- RLBigDataLinkage() or RLBigDataDedup() for large data sets. Hope this helps.

Merging two Data Frames using Fuzzy/Approximate String Matching in R

Approximate string matching is not a good idea since an incorrect match would invalidate the whole analysis. If the names from each source is the same each time, then building indexes seems the best option to me too. This is easily done in R:

Suppose you have the data:

a<-data.frame(name=c('Ace','Bayes'),price=c(10,13))
b<-data.frame(name=c('Ace Co.','Bayes Inc.'),qty=c(9,99))

Build an index of names for each source one time, perhaps using pmatch etc. as a starting point and then validating manually.

a.idx<-data.frame(name=c('Ace','Bayes'),idx=c(1,2))
b.idx<-data.frame(name=c('Ace Co.','Bayes Inc.'), idx=c(1,2))

Then for each run merge using:

a.rich<-merge(a,a.idx,by="name")
b.rich<-merge(b,b.idx,by="name")
merge(a.rich,b.rich,by="idx")

Which would give us:

  idx name.x price     name.y qty
1   1    Ace    10    Ace Co.   9
2   2  Bayes    13 Bayes Inc.  99

How to merge two data frames using (parts of) text values?

Try the RecordLinkage package.

Here is a possible solution where the merge works based on generally how "close" the two "words" match:

library(reshape2)
library(RecordLinkage)
set.seed(16)
l <- LETTERS[1:10]
ex1 <- data.frame(lets = paste(l, l, l, sep = ""), nums = 1:10)
ex2 <- data.frame(lets = paste(sample(l), sample(l), sample(l), sep = ""), 
                  nums = 11:20)
ex1
# lets nums
# 1   AAA    1
# 2   BBB    2
# 3   CCC    3
# 4   DDD    4
# 5   EEE    5
# 6   FFF    6
# 7   GGG    7
# 8   HHH    8
# 9   III    9
# 10  JJJ   10
ex2
# lets nums
# 1   GDJ   11
# 2   CFH   12
# 3   DBE   13
# 4   BED   14
# 5   FJB   15
# 6   JHG   16
# 7   AII   17
# 8   ICC   18
# 9   EGF   19
# 10  HAA   20
lets <- melt(outer(ex1$lets, ex2$lets, FUN = "levenshteinDist"))
lets <- lets[lets$value < 2, ] # adjust the "< 2" as necessary
cbind(ex1[lets$Var1, ], ex2[lets$Var2, ])
# lets nums lets nums
# 9  III    9  AII   17
# 3  CCC    3  ICC   18
# 1  AAA    1  HAA   20

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
  x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
  # 1 for already assigned, -1 for unassigned and unassignable
  while(any(x==0)){
    min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
    a_sel <- a[d==min_d & x==0][1] 
    b_sel <- b[d==min_d & a == a_sel & x==0][1] 
    x[a==a_sel & b == b_sel] <- 1
    x[x==0 & (a==a_sel|b==b_sel)] <- -1
  }
  cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co    Ace Co. 0.04762
2  Bayes Bayes Inc. 0.16667
3    asd       asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1  Ace Co    Ace Co. 0.04762
11   Baes Bayes Inc. 0.20000
8   Bayes Bayes Inc. 0.16667
12   Bays Bayes Inc. 0.20000
10    Bcy Bayes Inc. 0.37778
15    asd       asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

R: I Have to Do Softmatch in String