most efficient way to group search results by string similarity
If you can get ahold of a suitable thesaurus/ontology that basically provides the best clustering possible - since words are leaves in a concept tree, distance in the tree is the distance between words in a semantic sense. Thus cat and dog aren't nearly as close as tabby and calico (cat), but they're substantially closer than cat and banana, which themselves are closer than cat(n.) and jump(v.).
Allowing for small spelling errors (by looking for similarly spelled words that are in the thesaurus for words that aren't) could increase robustness, but it could also create unexpected results as a result of homonyms.
As for doing it in the database or in code, do it in code. To the extent that you can cache, that will be faster.
Efficient way to handle string similarity?
The following reproduces your expected output
library(dplyr)
library(purrr)
df %>%
mutate(Date = as.Date(Date)) %>%
mutate_if(is.factor, as.character) %>%
mutate(dist = map2_dbl(SerialNumber, SubSerialID, adist)) %>%
group_by(SubSerialID) %>%
filter(all(dist > 5) | Date == max(Date)) %>%
ungroup()
## A tibble: 5 x 4
# SerialNumber SubSerialID Date dist
# <chr> <chr> <date> <dbl>
#1 AGCC0775CFNDA10407EC AVCC0775CFNDA1040 2018-03-17 4
#2 CH~MT765E~C0765HFNCC1056 BGDC0865HFNKG1043 2019-01-07 15
#3 2658358 BGDC0865HFNKG1043 2018-08-09 15
#4 MT765E~C0765KFNCD1044 C0765KFNCD10 2015-04-07 9
#5 187A126 C0765KFNCD10 2017-11-30 11
The idea is to keep all entries (per SubSerialID
) if all Levenshtein distances between SubserialID
and SerialNumber
are greater than 5. If there is one distance <= 5
, only keep the row with the largest Date
. I've kept the dist
column for debugging; you can remove the column with select(-dist)
.
I'm not sure how generalisable this is. You will have to play around with the Levenshtein distance threshold (which I set to 5
in this case).
Sample data
df <- read.table(text =
"SerialNumber SubSerialID Date
AGCC0775CFNDA1040TMT775 AVCC0775CFNDA1040 2018/01/08
AGCC0775CFNDA1040 AVCC0775CFNDA1040 2015/12/28
AGCC0775CFNDA10407EC AVCC0775CFNDA1040 2018/03/17
CH~MT765E~C0765HFNCC1056 BGDC0865HFNKG1043 2019/01/07
2658358 BGDC0865HFNKG1043 2018/08/09
MT765E~C0765KFNCD1044 C0765KFNCD10 2015/04/07
187A126 C0765KFNCD10 2017/11/30", header = T)
Finding groups of similar strings in a large set of strings
Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.
Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:
http://matpalm.com/resemblance/
Tips for efficient string matching (and indexing) for large data in R?
Here is a tidyverse
alternative. One possible shortcoming is that IDs that don't exist in the data won't be included in the output - although this could be worked around if necessary.
library(tidyverse)
groups.with.id <- groups %>%
enframe() %>%
separate_rows(value, sep = " ") %>%
group_by(value) %>%
summarise(ids = list(name)) %>%
mutate(ids = set_names(ids, value)) %>%
pull(ids)
If you also want to know which IDs are not present in any of the groups you can use:
setdiff(ids, names(groups.with.id))
Benchmarking indicates this approach is more than 100 times faster on the example dataset than using grep
with fixed = TRUE
.
microbenchmark::microbenchmark(original = sapply(ids, grep, groups),
original_fixed = sapply(ids, grep, groups, fixed = TRUE),
separate_summarise_groups = {groups %>%
enframe() %>%
separate_rows(value, sep = " ") %>%
group_by(value) %>%
summarise(ids = list(name)) %>%
mutate(ids = set_names(ids, value)) %>%
pull(ids)}, times = 10, unit = "relative")
Unit: relative
expr min lq mean median uq max neval cld
original 685.0922 695.7236 382.0759 641.2018 290.30233 188.40790 10 c
original_fixed 199.8912 209.1225 115.5693 199.9749 85.89842 59.26886 10 b
separate_summarise_groups 1.0000 1.0000 1.0000 1.0000 1.00000 1.00000 10 a
Grouping similar strings together from a list
You can use a dictionary to from groups progressively with only the cities that have not yet been grouped.
Note that I don't have fussywuzzy so I created a ghetto ratio calculator to test the solution. I also removed the accent characters to make this easier (my objective was not to create a good string comparison function)
from collections import Counter
stripJunk = str.maketrans("","","- ")
def getRatio(a,b):
a = a.lower().translate(stripJunk)
b = b.lower().translate(stripJunk)
total = len(a)+len(b)
counts = (Counter(a)-Counter(b))+(Counter(b)-Counter(a))
return 100 - 100 * sum(counts.values()) / total
Here is the grouping logic (you can replace my custom getRatio() function with the one from fuzzywuzzy):
data = ['MONTREAL EDUCATION BOARD', 'Ile de Montreal', 'Montreal',
'Ville de Montreal', 'MONTREAL CITY', 'Monrteal', 'Mont-real',
'Toronto', 'Toronto city', 'Tornoto', 'What is this', 'Bananasplit',
'Banana', 'StLouis', 'St Louis', 'Saint Louis']
treshold = 75
minGroupSize = 1
from itertools import combinations
paired = { c:{c} for c in data }
for a,b in combinations(data,2):
if getRatio(a,b) < treshold: continue
paired[a].add(b)
paired[b].add(a)
groups = list()
ungrouped = set(data)
while ungrouped:
bestGroup = {}
for city in ungrouped:
g = paired[city] & ungrouped
for c in g.copy():
g &= paired[c]
if len(g) > len(bestGroup):
bestGroup = g
if len(bestGroup) < minGroupSize : break # to terminate grouping early change minGroupSize to 3
ungrouped -= bestGroup
groups.append(bestGroup)
The groups
variable is a list that will contain sets of city names (the groups). Cities will only appear in one group.
# With a treshold of 75%:
{'MONTREAL CITY', 'Montreal', 'Monrteal', 'Mont-real'}
{'St Louis', 'StLouis', 'Saint Louis'}
{'Toronto', 'Toronto city', 'Tornoto'}
{'Ville de Montreal', 'Ile de Montreal'}
{'MONTREAL EDUCATION BOARD'}
{'Bananasplit'}
{'Banana'}
{'What is this'}
With a lower treshold (or a better comparison function), you will get fewer groups:
# With a treshold of 65%:
{'Monrteal', 'Montreal', 'Ville de Montreal', 'MONTREAL CITY', 'Mont-real', 'Ile de Montreal'}
{'Toronto', 'Toronto city', 'Tornoto'}
{'Saint Louis', 'StLouis', 'St Louis'}
{'Banana', 'Bananasplit'}
{'What is this'}
{'MONTREAL EDUCATION BOARD'}
From a performance perspective this will produce results in reasonable time for relatively small data sets. It took 83 seconds to group 1600 cities. Because of the O(N^2) nature of the combinations() loop, this will probably become impractical when you get to 15,000 items in the list.
The grouping loop starts with the larger groups. It takes up about half of the processing time. You could probably save a some of that time by stopping it once you reach a small enough group. That is if you don't need a bazillion 1-2 city groups. I tried stopping the grouping loop when the group size becomes less than 3 and the 1600 cities were processed in 48 seconds (so a significant save for my simulated data). You may not get this much of a performance boost with actual data though.
What string similarity algorithms are there?
It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.
Related Topics
How to Tell the R Interpreter How to Use the Proxy Server
Add Image in Title Page of Rmarkdown PDF
How to Remove Rows with 0 Values Using R
Make Sequential Numeric Column Names Prefixed with a Letter
How to Generalize Outer to N Dimensions
The Simplest Way to Convert a List with Various Length Vectors to a Data.Frame in R
Elegant Way to Select the Color for a Particular Segment of a Line Plot
Converting Data Frame Column from Character to Numeric
Asymmetric Color Distribution in Scale_Gradient2
Export Data Frames to Excel via Xlsx with Conditional Formatting
Avoiding the Infamous "Eval(Parse())" Construct
Using Un-Exported Function from Another R Package
R: Ggplot Stacked Bar Chart with Counts on Y Axis But Percentage as Label
Accurately Converting from Character->Posixct->Character with Sub Millisecond Datetimes
Activate Tabpanel from Another Tabpanel