Efficient String Similarity Grouping

most efficient way to group search results by string similarity

If you can get ahold of a suitable thesaurus/ontology that basically provides the best clustering possible - since words are leaves in a concept tree, distance in the tree is the distance between words in a semantic sense. Thus cat and dog aren't nearly as close as tabby and calico (cat), but they're substantially closer than cat and banana, which themselves are closer than cat(n.) and jump(v.).

Allowing for small spelling errors (by looking for similarly spelled words that are in the thesaurus for words that aren't) could increase robustness, but it could also create unexpected results as a result of homonyms.

As for doing it in the database or in code, do it in code. To the extent that you can cache, that will be faster.

Efficient way to handle string similarity?

The following reproduces your expected output

library(dplyr)
library(purrr)
df %>%
    mutate(Date = as.Date(Date)) %>%
    mutate_if(is.factor, as.character) %>%
    mutate(dist = map2_dbl(SerialNumber, SubSerialID, adist)) %>%
    group_by(SubSerialID) %>%
    filter(all(dist > 5) | Date == max(Date)) %>%
    ungroup()
## A tibble: 5 x 4
#  SerialNumber             SubSerialID       Date        dist
#  <chr>                    <chr>             <date>     <dbl>
#1 AGCC0775CFNDA10407EC     AVCC0775CFNDA1040 2018-03-17     4
#2 CH~MT765E~C0765HFNCC1056 BGDC0865HFNKG1043 2019-01-07    15
#3 2658358                  BGDC0865HFNKG1043 2018-08-09    15
#4 MT765E~C0765KFNCD1044    C0765KFNCD10      2015-04-07     9
#5 187A126                  C0765KFNCD10      2017-11-30    11

The idea is to keep all entries (per SubSerialID) if all Levenshtein distances between SubserialID and SerialNumber are greater than 5. If there is one distance <= 5, only keep the row with the largest Date. I've kept the dist column for debugging; you can remove the column with select(-dist).

I'm not sure how generalisable this is. You will have to play around with the Levenshtein distance threshold (which I set to 5 in this case).

Sample data

df <- read.table(text =
"SerialNumber                SubSerialID            Date

AGCC0775CFNDA1040TMT775     AVCC0775CFNDA1040     2018/01/08
AGCC0775CFNDA1040           AVCC0775CFNDA1040     2015/12/28
AGCC0775CFNDA10407EC        AVCC0775CFNDA1040     2018/03/17
CH~MT765E~C0765HFNCC1056    BGDC0865HFNKG1043     2019/01/07
2658358                     BGDC0865HFNKG1043     2018/08/09
MT765E~C0765KFNCD1044       C0765KFNCD10          2015/04/07
187A126                     C0765KFNCD10          2017/11/30", header = T)

Finding groups of similar strings in a large set of strings

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/

Tips for efficient string matching (and indexing) for large data in R?

Here is a tidyverse alternative. One possible shortcoming is that IDs that don't exist in the data won't be included in the output - although this could be worked around if necessary.

library(tidyverse)

groups.with.id <- groups %>% 
  enframe() %>%
  separate_rows(value, sep = " ") %>%
  group_by(value) %>%
  summarise(ids = list(name)) %>%
  mutate(ids = set_names(ids, value)) %>%
  pull(ids)

If you also want to know which IDs are not present in any of the groups you can use:

setdiff(ids, names(groups.with.id))

Benchmarking indicates this approach is more than 100 times faster on the example dataset than using grep with fixed = TRUE.

microbenchmark::microbenchmark(original = sapply(ids, grep, groups),
                               original_fixed = sapply(ids, grep, groups, fixed = TRUE),
                               separate_summarise_groups = {groups %>% 
                                   enframe() %>%
                                   separate_rows(value, sep = " ") %>%
                                   group_by(value) %>%
                                   summarise(ids = list(name)) %>%
                                   mutate(ids = set_names(ids, value)) %>%
                                   pull(ids)}, times = 10, unit = "relative")

Unit: relative
                      expr      min       lq     mean   median        uq       max neval cld
                  original 685.0922 695.7236 382.0759 641.2018 290.30233 188.40790    10   c
            original_fixed 199.8912 209.1225 115.5693 199.9749  85.89842  59.26886    10  b 
 separate_summarise_groups   1.0000   1.0000   1.0000   1.0000   1.00000   1.00000    10 a

Grouping similar strings together from a list

You can use a dictionary to from groups progressively with only the cities that have not yet been grouped.

Note that I don't have fussywuzzy so I created a ghetto ratio calculator to test the solution. I also removed the accent characters to make this easier (my objective was not to create a good string comparison function)

from collections import Counter
stripJunk = str.maketrans("","","- ")
def getRatio(a,b):
    a = a.lower().translate(stripJunk)
    b = b.lower().translate(stripJunk)
    total  = len(a)+len(b)
    counts = (Counter(a)-Counter(b))+(Counter(b)-Counter(a))
    return 100 - 100 * sum(counts.values()) / total

Here is the grouping logic (you can replace my custom getRatio() function with the one from fuzzywuzzy):

data = ['MONTREAL EDUCATION BOARD', 'Ile de Montreal', 'Montreal',
       'Ville de Montreal', 'MONTREAL CITY', 'Monrteal', 'Mont-real',
       'Toronto', 'Toronto city', 'Tornoto', 'What is this', 'Bananasplit',
       'Banana', 'StLouis', 'St Louis', 'Saint Louis']

treshold     = 75
minGroupSize = 1

from itertools import combinations

paired = { c:{c} for c in data }
for a,b in combinations(data,2):
    if getRatio(a,b) < treshold: continue
    paired[a].add(b)
    paired[b].add(a)

groups    = list()
ungrouped = set(data)
while ungrouped:
    bestGroup = {}
    for city in ungrouped:
        g = paired[city] & ungrouped
        for c in g.copy():
            g &= paired[c] 
        if len(g) > len(bestGroup):
            bestGroup = g
    if len(bestGroup) < minGroupSize : break  # to terminate grouping early change minGroupSize to 3
    ungrouped -= bestGroup
    groups.append(bestGroup)

The groups variable is a list that will contain sets of city names (the groups). Cities will only appear in one group.

# With a treshold of 75%:
{'MONTREAL CITY', 'Montreal', 'Monrteal', 'Mont-real'}
{'St Louis', 'StLouis', 'Saint Louis'}
{'Toronto', 'Toronto city', 'Tornoto'}
{'Ville de Montreal', 'Ile de Montreal'}
{'MONTREAL EDUCATION BOARD'}
{'Bananasplit'}
{'Banana'}
{'What is this'}

With a lower treshold (or a better comparison function), you will get fewer groups:

# With a treshold of 65%:
{'Monrteal', 'Montreal', 'Ville de Montreal', 'MONTREAL CITY', 'Mont-real', 'Ile de Montreal'}
{'Toronto', 'Toronto city', 'Tornoto'}
{'Saint Louis', 'StLouis', 'St Louis'}
{'Banana', 'Bananasplit'}
{'What is this'}
{'MONTREAL EDUCATION BOARD'}

From a performance perspective this will produce results in reasonable time for relatively small data sets. It took 83 seconds to group 1600 cities. Because of the O(N^2) nature of the combinations() loop, this will probably become impractical when you get to 15,000 items in the list.

The grouping loop starts with the larger groups. It takes up about half of the processing time. You could probably save a some of that time by stopping it once you reach a small enough group. That is if you don't need a bazillion 1-2 city groups. I tried stopping the grouping loop when the group size becomes less than 3 and the 1600 cities were processed in 48 seconds (so a significant save for my simulated data). You may not get this much of a performance boost with actual data though.

What string similarity algorithms are there?

It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.