Longest common substring from more than two strings
These paired functions will find the longest common string in any arbitrary array of strings:
def long_substr(data):
substr = ''
if len(data) > 1 and len(data[0]) > 0:
for i in range(len(data[0])):
for j in range(len(data[0])-i+1):
if j > len(substr) and is_substr(data[0][i:i+j], data):
substr = data[0][i:i+j]
return substr
def is_substr(find, data):
if len(data) < 1 and len(find) < 1:
return False
for i in range(len(data)):
if find not in data[i]:
return False
return True
print long_substr(['Oh, hello, my friend.',
'I prefer Jelly Belly beans.',
'When hell freezes over!'])
No doubt the algorithm could be improved and I've not had a lot of exposure to Python, so maybe it could be more efficient syntactically as well, but it should do the job.
EDIT: in-lined the second is_substr function as demonstrated by J.F. Sebastian. Usage remains the same. Note: no change to algorithm.
def long_substr(data):
substr = ''
if len(data) > 1 and len(data[0]) > 0:
for i in range(len(data[0])):
for j in range(len(data[0])-i+1):
if j > len(substr) and all(data[0][i:i+j] in x for x in data):
substr = data[0][i:i+j]
return substr
Hope this helps,
Jason.
Longest Common Substring (more than 2 arguments)
Although, it might be a bit too late to answer now, I think, I might have found the issue you have.
It is with the function call.
printLCS(myArgs[0],myArgs[1]);
You are specifying the third and fourth argument, while perhaps a better way would be to spread it all like this.
printLCS(...myArgs);
How to find the longest common substring of multiple strings?
This is a relatively optimised naïve algorithm. You first transform each sequence into a set of all its ngrams. Then you intersect all sets and find the longest ngram in the intersection.
from functools import partial, reduce
from itertools import chain
from typing import Iterator
def ngram(seq: str, n: int) -> Iterator[str]:
return (seq[i: i+n] for i in range(0, len(seq)-n+1))
def allngram(seq: str) -> set:
lengths = range(len(seq))
ngrams = map(partial(ngram, seq), lengths)
return set(chain.from_iterable(ngrams))
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a@s;"]
seqs_ngrams = map(allngram, sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len) # -> brown
While this might get you through short sequences, this algorithm is extremely inefficient on long sequences. If your sequences are long, you can add a heuristic to limit the largest possible ngram length (i.e. the longest possible common substring). One obvious value for such a heuristic may be the shortest sequence's length.
def allngram(seq: str, minn=1, maxn=None) -> Iterator[str]:
lengths = range(minn, maxn) if maxn else range(minn, len(seq))
ngrams = map(partial(ngram, seq), lengths)
return set(chain.from_iterable(ngrams))
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a@s;"]
maxn = min(map(len, sequences))
seqs_ngrams = map(partial(allngram, maxn=maxn), sequences)
intersection = reduce(set.intersection, seqs_ngrams)
longest = max(intersection, key=len) # -> brown
This may still take too long (or make your machine run out of RAM), so you might want to read about some optimal algorithms (see the link I left in my comment to your question).
Update
To count the number of strings wherein each ngram occurs
from collections import Counter
sequences = ["brownasdfoersjumps",
"foxsxzxasis12sa[[#brown",
"thissasbrownxc-34a@s;"]
seqs_ngrams = map(allngram, sequences)
counts = Counter(chain.from_iterable(seqs_ngrams))
Counter
is a subclass of dict
, so its instances have similar interfaces:
print(counts)
Counter({'#': 1,
'#b': 1,
'#br': 1,
'#bro': 1,
'#brow': 1,
'#brown': 1,
'-': 1,
'-3': 1,
'-34': 1,
'-34a': 1,
'-34a@': 1,
'-34a@s': 1,
'-34a@s;': 1,
...
You can filter the counts to leave substrings occurring in at least n
strings: {string: count for string, count in counts.items() if count >= n}
Longest common substring for more than two strings in PowerShell?
$arr = "qdfbsqds", "fbsqdt", "bsqda"
$arr | %{
$substr = for ($s = 0; $s -lt $_.length; $s++) {
for ($l = 1; $l -le ($_.length - $s); $l++) {
$_.substring($s, $l);
}
}
$substr | %{$_.toLower()} | select -unique
} | group | ?{$_.count -eq $arr.length} | sort {$_.name.length} | select -expand name -l 1
# returns bsqd
- produce a list of all the unique substrings of the inputstrings
- filter for substrings that occur #inputstrings times (i.e. in all input strings)
- sort these filtered substrings based on the length of the substring
- return the last (i.e. longest) of this list
How to calculate longest common substring anywhere in two strings
One approach might be to look at the transformation sequence produced by adist()
and count the characters in the longest contiguous match:
trafos <- attr(adist(string1, vec1, counts = TRUE), "trafos")
sapply(gregexpr("M+", trafos), function(x) max(0, attr(x, "match.length")))
[1] 3 8 1 3 5
Longest common sequence of words from more than two strings
The key is to modify to search by whole-word subsequences.
from itertools import islice
def is_sublist(source, target):
slen = len(source)
return any(all(item1 == item2 for (item1, item2) in zip(source, islice(target, i, i+slen))) for i in range(len(target) - slen + 1))
def long_substr_by_word(data):
subseq = []
data_seqs = [s.split(' ') for s in data]
if len(data_seqs) > 1 and len(data_seqs[0]) > 0:
for i in range(len(data_seqs[0])):
for j in range(len(data_seqs[0])-i+1):
if j > len(subseq) and all(is_sublist(data_seqs[0][i:i+j], x) for x in data_seqs):
subseq = data_seqs[0][i:i+j]
return ' '.join(subseq)
Demo:
>>> data = ['commercial van for movers',
... 'partial van for movers',
... 'commercial van for moving']
>>> long_substr_by_word(data)
'van for'
>>>
>>> data = ['a bx bx z', 'c bx bx zz']
>>> long_substr_by_word(data)
'bx bx'
Related Topics
How to Escape Curly-Brackets in F-Strings
Target Wsgi Script Cannot Be Loaded as Python Module
Fill Between Two Vertical Lines in Matplotlib
How to Use Win32 API with Python
How to Compare Dates in Django Templates
Python Requests.Get() Returns Improperly Decoded Text Instead of Utf-8
How to Re Import an Updated Package While in Python Interpreter
How to Access the Real Value of a Cell Using the Openpyxl Module for Python
How to Call Function That Takes an Argument in a Django Template
Getting the Indices of Several Elements in a Numpy Array at Once
How to Write Tests for the Argparse Portion of a Python Module
How to Get Tkinter Canvas to Dynamically Resize to Window Width
Access Memory Address in Python
No Module Named 'Pandas._Libs.Tslibs.Timedeltas' in Pyinstaller
Display the Output of the Program on Gui with Tkinter
How to Use a Custom Comparison Function in Python 3