Find longest repetitive sequence in a string
This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).
That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.
Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.
There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.
Hope this helps!
How to find the longest repeated sequence of a string
To be precise you are looking for the longest 3-fold substring.
A k-fold substring of a string s is a repeated substring of s that appears at least k times in s
There was a similar python question a few years ago that would give you a lot of good information. Specifically take a look at Finding the Longest Multiple Repeat. There is a solution on GitHub but it is also in Python.
Longest Repeating Character In String - Javascript
Here is my solution:
function longestRepetition (str) {
if (str.length === 0) {
return ['', 0]
}
let longest = '';
let chunk = '';
for (let i = 0; i < str.length; i++) {
if (i === 0) {
if (str[i] === str[i + 1]) {
chunk += str[i];
}
}
if (i > 0) {
if (str[i] === str[i - 1]) {
chunk += str[i];
console.log('chunk**', chunk);
}
if (str[i] !== str[i - 1]) {
chunk = str[i];
}
if (chunk.length > longest.length) {
longest = chunk;
}
}
}
return [longest[0], longest.length];
}
console.log(longestRepetition("bbbaaabaaaa"))
The longest repeating sequence of byte
Here's my hacked together solution I wrote yesterday...
Basically it checks if input.charAt(i) == input.charAt(i + 1)
and if so, runs a second loop until they don't match, all the while appending to a String
, and adds to a List
. And repeat.
Then check the List
for the highest occurrence (shamelessly stolen from here)
public static void addToList(String input) {
String temp;
List<String> l = new ArrayList<>();
for (int i = 0; i < input.length() - 1; i++) {
if (input.charAt(i) == input.charAt(i + 1)) {
temp = String.valueOf(input.charAt(i));
for (int j = i; j < input.length() - 1; j++) {
if (input.charAt(j) == input.charAt(j + 1)) {
temp += String.valueOf(input.charAt(j + 1));
if (j == input.length() - 2) {
i = j;
if (!temp.isEmpty()) {
l.add(temp);
}
break;
}
} else {
i = j - 1;
if (!temp.isEmpty()) {
l.add(temp);
}
break;
}
}
}
}
System.out.println(getHighestOccurences(l));
}
public static String getHighestOccurences(List<String> list) {
int max = 0;
int curr;
String currKey = null;
Set<String> unique = new HashSet<>(list);
for (String key : unique) {
curr = Collections.frequency(list, key);
if (max < curr) {
max = curr;
currKey = key;
}
}
return currKey;
}
With your input being String input = "7888885466662716666";
and calling addToList(input);
gives an output of:
6666
.
Online Demo
Longest substring repeating atleast k times
There was large discussion in comments, I think it's better write an answer to sum up. TL;DR Longest substring repeating atleast k times
There is less efficient method, but it's really easier to understand than suffix trees: all you need to know is polynomial hashing and binary search.
1. String polynomial hash
Read about it here https://cp-algorithms.com/string/string-hashing.html. Below is short description of this technique.
Let's say we have string s
, integers p
and mod
. Now we can define hash function:
hash(s) = (ord(s[0])*p^(len(s)-1) + ord(s[1])*p^(len(s)-2) + ... + ord(s[len(s)-1])*p^0) % mod
where ord
is a function returning an integer by character (let's say it's an ASCII-code of a character). Polynomial hash can be easily calculated for each prefix of string in O(len(s)):
# h[i] is a hash of prefix of length i.
# For example s = "abacaba",
# h[0] = hash("") = 0
# h[1] = hash("a")
# h[2] = hash("ab")
# ...
# h[7] = hash("abacaba")
h[0] = 0
for i in 1..n:
h[i] = (h[i-1] * p + ord(s[i-1])) % mod
Also let's precalculate p^0 % mod, p^1 % mod, ..., p^len(s) % mod in array pow
:
# pow[i] is (p^i) % mod
pow[0] = 1
for i in 1..n:
pow[i] = (pow[i-1] * p) % mod
Using arrays h
and pow
we can easily calculate hash of any substring of the string s
:
# get_substring_hash returns hash(s[l] + s[l+1] + ... + s[r-1]).
def get_substring_hash(s, l, r):
value = h[r] - h[l]*pow[r-l] # line a
return (value%mod + mod) % mod # line b
Let's understand why code above works.
h[r] = (ord(s[r-1])*p^0 + ord(s[r-2])*p^1 + ... + ord(s[l-1])*p^(r-l) + ord(s[l-2])*p^(r-l+1) + ...) % mod
h[l] = ( ord(s[l-1])*p^0 + ord(s[l-2])*p^1 + ...) % mod
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As you can see we need only ^^^
-part from h[r]
so we have to get rid of ~~~
-part. ~~~
-part in h[r]
p^(r-l)
times larger than in h[l]
and this explains line a.
Line b is kinda magic when operating with % mod
, value
after line a can be negative, so value%mod + mod
definitely makes is positive. At the same time if value
was positive after line a value%mod + mod
is larger than mod
, so (value%mod + mod) % mod
will definitely return value in range 0, 1, ..., mod-1.
Finally, mod
is a large prime number like 10^9+7 and value
is a small number, but larger than any possible ASCII-code like 239 (read in article why so).
Some notes:
- Hashes may collide, because we have only
mod
possible values of hash but number of strings is infinite. How to deal with it read in article. - Doing operations like
h[r] - h[l]*pow[r-l]
may require using of 64-bit types of integers.
2. Binary search
Just read about it on Wikipedia, there's nothing specific https://en.wikipedia.org/wiki/Binary_search_algorithm.
3. Longest substring repeating at least k times solution
Let's say we precalculated arrays h
and pow
. Let's do binary search to find maximum length ans
of string such that there are k
or more such substrings in the given string s
.
Why binary search works here? Because if there's some length x
such as there are k
or more equal substrings in s
of length x
, then there's definitely k
or more equal substrings in s
of length x-1
(just remove last letter from our matches).
How will binary search work? Let's say we're currently testing if there are k
or more equal substrings of length mid
. We're going to calculate all hashes of length mid
(using get_substring_hash
) and we'll make a decision of changing borders of binary search if there is a k
equal hashes or not.
For example: s = "abcabcdefgdefgdefgdefg", k = 3. Answer is "defgdefg":
abcabcdefgdefgdefgdefg
^^^^^^^^ occurence 1
^^^^^^^^ occurence 2
^^^^^^^^ occurence 3
Binary search iterations:
l = 1, r = 22, mid = 11. No substring of length 11 satisfy.
l = 1, r = 10, mid = 5. There should be hash("defgd") be seen 3 times.
l = 5, r = 10, mid = 7. There should be hash("defgdef") be seen 3 times.
l = 7, r = 10, mid = 8. There should be hash("defgdefg") be seen 3 times.
l = 8, r = 10, mid = 9. No substring of length 9 satisfy.
l = 8, r = 8. That means answer is 8.
As you can see, complexity is O(n log n): round(log n) binary search iterations and O(n) complexity per iteration if you use something like std::unordered_map
to check if there's a hash with >= k occurrences or not.
I really hope everything is clear now.
Find the longest repeated subsequence of max length N in Python
Here is a simpler and more efficient solution:
def longest_subsequence(events, limit, sep='->'):
events = list(enumerate(events.split(sep)))
output = {}
seen = {}
for n in range(limit, 0, -1):
for combination in itertools.combinations(events, n):
indexes, key = zip(*combination)
if key in seen:
if key not in output and seen[key].isdisjoint(indexes):
output[key] = sep.join(key)
else:
seen[key] = set(indexes)
if output:
break
return list(output.values())
This looks at the longest matches first and terminates early if one is found. It eliminates self-overlapping repeated subsequences by saving the indexes of the last match and comparing them with the current candidate.
Demo:
samples = (
'A->B->E->D->A->C->B->D',
'A->B->C->A->B',
'A->B->A',
'A->B->E->D->A->C->B->E->D->A',
'B->B->B->C->C',
'A->B->A->B->C->C',
'A',
'',
)
for index, sample in enumerate(samples, 1):
result = longest_subsequence(sample, 4)
print('(%s) %r\n%s\n' % (index, sample, result))
Output:
(1) 'A->B->E->D->A->C->B->D'
['A->B->D']
(2) 'A->B->C->A->B'
['A->B']
(3) 'A->B->A'
['A']
(4) 'A->B->E->D->A->C->B->E->D->A'
['A->B->E->D', 'B->E->D->A']
(5) 'B->B->B->C->C'
['B->C']
(6) 'A->B->A->B->C->C'
['A->B->C']
(7) 'A'
[]
(8) ''
[]
Match longest repeating sequence (that is not made of a repeating sequence)
You can use two lookaheads, the first one searches the longest pattern and the second searches the smallest. The second pattern repeated has to end (at least) at the same position or after the first pattern repeated. To check that, you have to capture the end of the string in the first lookahead and to use a backreference to this capture in the second lookahead.
(?=(.+)\1+(.*))(?=(.+?)\3+\2$)\3+
demo
The result is in the group 3
also:
(?=(.+)\1+(.*))(.+?)\3+(?=\2$)\3*
Note that these two regex patterns give the result for one position in a string. If you want to know what is the shortest pattern that is the longest substring once repeated for all the string, you have to find all of them and to select the longest susbstring with code. You can use a pattern for overlapping results to do that:
(?=(.+)\1+(.*))(?=(.+?)\3+\2$)(?=(\3+))
(using the group 4)
Related Topics
Why Does This Not Work as an Array Membership Test
Too Many Different Python Versions on My System and Causing Problems
Nonlocal Keyword in Python 2.X
Opencv Real Time Streaming Video Capture Is Slow. How to Drop Frames or Get Synced with Real Time
Pycharm Current Working Directory
How to Flatten Lists Without Splitting Strings
How to Get the Current Time in Milliseconds in Python
Combining Conda Environment.Yml with Pip Requirements.Txt
Find the Closest Date to a Given Date
Differencebetween .Quit and .Quit in Pygame
Why Can a Python Dict Have Multiple Keys with the Same Hash
Using Requests with Tls Doesn't Give Sni Support
How to Request a Url in Python and Not Follow Redirects
What Does the Term "Broadcasting" Mean in Pandas Documentation