Merging Through Fuzzy Matching of Variables in R

Merging through fuzzy matching of variables in R

The agrep function (part of base R), which does approximate string matching using the Levenshtein edit distance is probably worth trying. Without knowing what your data looks like, I can't really suggest a working solution. But this is a suggestion... It records matches in a separate list (if there are multiple equally good matches, then these are recorded as well). Let's say that your data.frame is called df:

l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
  father_id <- with(df,which(student_name[i] == father_name))
  if(length(father_id) == 1){
    matches[['father']][[i]] <- father_id
  } else {
    old_father_id <- NULL
    ## try to find the total                                                                                                                                 
    for(m in 10:1){ ## m is the maximum distance                                                                                                             
      father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
      if(length(father_id) == 1 || m == 1){
        ## if we find a unique match or if we are in our last round, then stop                                                                               
        matches[['father']][[i]] <- father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) > 0) {
        ## if we can't do better than multiple matches, then record them anyway                                                                              
        matches[['father']][[i]] <- old_father_id
        break
      } else if(length(father_id) == 0 && length(old_father_id) == 0) {
        ## if the nearest match is more than 10 different from the current pattern, then stop                                                                
        break
      }
    }
  }
}

The code for the mother_name would be basically the same. You could even put them together in a loop, but this example is just for the purpose of illustration.

Fuzzy merge on multiple variables (all but one with no mispellings)

stringdist_join is a wrapper around fuzzy_join, and fuzzy_join has a match_fun argument that can either be a unique function or a list of functions as long as your by argument, so we can use fuzzy_full_join (which is just fuzzy_join with mode= "full"):

library(fuzzyjoin)
res <- fuzzy_full_join(dataset_1, dataset_2, 
                by=c("var_1","var_2","var_3"),
                list(`==`, `==`, function(x,y) stringdist::stringdist(x,y, "soundex") <= 2))
res
#   var_1.x var_2.x var_3.x var_1.y var_2.y var_3.y var_4
# 1    1995      AA    AAAA    1995      AA    AAAA     A
# 2    1996      AA    AAAA    1996      AA    AAAA     B
# 3    1995      BB    BBBB    1995      BB    BBBB     C
# 4    1996      BB    BBBB    1996      BB    BBBC     D

Because of the nature of fuzzy matching, values are not generally the same on the lhs and rhs, so we end up with two sets of by columns, if you want to preserve only the lhs we can do :

library(dplyr)
res %>% 
  select(-ends_with(".y")) %>%
  rename_all(~sub("\\.x$","",.))

#   var_1 var_2 var_3 var_4
# 1  1995    AA  AAAA     A
# 2  1996    AA  AAAA     B
# 3  1995    BB  BBBB     C
# 4  1996    BB  BBBB     D

How to fuzzy join 2 dataframes on 2 variables with differing fuzzy logic ?

You can create a cartesian product of two dataframes using merge and then subset the rows which follow our required conditions.

subset(merge(a, b, by = NULL), abs(KW.x - KW.y) <= 1 & 
                               abs(price.x - price.y) <= 0.02)

#  name.x   KW.x price.x   KW.y price.y name.y
#1      A 201902    1.99 201903    1.98      a
#5      B 201904    3.02 201904    3.00      b
#9      C 201905    5.00 201904    5.00      c

R: Fuzzy merge using agrep and data.table

A possible solution using 'fuzzyjoin':

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
                                   ignore.case=TRUE,
                                   max.distance = 0.05, useBytes = TRUE))

dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
#          Name.x A          Name.y B
#1   ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2  ABN AMRO GROUP q

NOTE : The main problem, that you encountered too, was that agrep and agrepl don't seem to expect the first argument to be a vector. That's the reason why I wrapped the call with Vectorize.

This method can be used together with an equi-join (mind the order of columns in the by!):

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)

Merging Through Fuzzy Matching of Variables in R