Find Longest Repetitive Sequence in a String

Find longest repetitive sequence in a string

This problem is a variant of the longest repeated substring problem and there is an O(n)-time algorithm for solving it that uses suffix trees. The idea (as suggested by Wikipedia) is to construct a suffix tree (time O(n)), annotate all the nodes in the tree with the number of descendants (time O(n) using a DFS), and then to find the deepest node in the tree with at least three descendants (time O(n) using a DFS). This overall algorithm takes time O(n).

That said, suffix trees are notoriously hard to construct, so you would probably want to find a Python library that implements suffix trees for you before attempting this implementation. A quick Google search turns up this library, though I'm not sure whether this is a good implementation.

Another option would be to use suffix arrays in conjunction with LCP arrays. You can iterate over pairs of adjacent elements in the LCP array, taking the minimum of each pair, and store the largest number you find this way. That will correspond to the length of the longest string that repeats at least three times, and from there you can then read off the string itself.

There are several simple algorithms for building suffix arrays (the Manber-Myers algorithm runs in time O(n log n) and isn't too hard to code up), and Kasai's algorithm builds LCP arrays in time O(n) and is fairly straightforward to code up.

Hope this helps!

How to find the longest repeated sequence of a string

To be precise you are looking for the longest 3-fold substring.

A k-fold substring of a string s is a repeated substring of s that appears at least k times in s

There was a similar python question a few years ago that would give you a lot of good information. Specifically take a look at Finding the Longest Multiple Repeat. There is a solution on GitHub but it is also in Python.

Longest Repeating Character In String - Javascript

Here is my solution:

function longestRepetition (str) {
if (str.length === 0) {
return ['', 0]
}
let longest = '';
let chunk = '';
for (let i = 0; i < str.length; i++) {
if (i === 0) {
if (str[i] === str[i + 1]) {
chunk += str[i];
}
}
if (i > 0) {
if (str[i] === str[i - 1]) {
chunk += str[i];
console.log('chunk**', chunk);
}
if (str[i] !== str[i - 1]) {
chunk = str[i];
}
if (chunk.length > longest.length) {
longest = chunk;
}
}
}
return [longest[0], longest.length];
}

console.log(longestRepetition("bbbaaabaaaa"))

The longest repeating sequence of byte

Here's my hacked together solution I wrote yesterday...

Basically it checks if input.charAt(i) == input.charAt(i + 1) and if so, runs a second loop until they don't match, all the while appending to a String, and adds to a List. And repeat.

Then check the List for the highest occurrence (shamelessly stolen from here)

public static void addToList(String input) {
String temp;
List<String> l = new ArrayList<>();
for (int i = 0; i < input.length() - 1; i++) {
if (input.charAt(i) == input.charAt(i + 1)) {
temp = String.valueOf(input.charAt(i));
for (int j = i; j < input.length() - 1; j++) {
if (input.charAt(j) == input.charAt(j + 1)) {
temp += String.valueOf(input.charAt(j + 1));
if (j == input.length() - 2) {
i = j;
if (!temp.isEmpty()) {
l.add(temp);
}
break;
}
} else {
i = j - 1;
if (!temp.isEmpty()) {
l.add(temp);
}
break;
}
}
}
}
System.out.println(getHighestOccurences(l));
}

public static String getHighestOccurences(List<String> list) {
int max = 0;
int curr;
String currKey = null;
Set<String> unique = new HashSet<>(list);
for (String key : unique) {
curr = Collections.frequency(list, key);
if (max < curr) {
max = curr;
currKey = key;
}
}
return currKey;
}

With your input being String input = "7888885466662716666"; and calling addToList(input); gives an output of:

6666

.

Online Demo

Longest substring repeating atleast k times

There was large discussion in comments, I think it's better write an answer to sum up. TL;DR Longest substring repeating atleast k times

There is less efficient method, but it's really easier to understand than suffix trees: all you need to know is polynomial hashing and binary search.

1. String polynomial hash

Read about it here https://cp-algorithms.com/string/string-hashing.html. Below is short description of this technique.

Let's say we have string s, integers p and mod. Now we can define hash function:

hash(s) = (ord(s[0])*p^(len(s)-1) + ord(s[1])*p^(len(s)-2) + ... + ord(s[len(s)-1])*p^0) % mod 

where ord is a function returning an integer by character (let's say it's an ASCII-code of a character). Polynomial hash can be easily calculated for each prefix of string in O(len(s)):

# h[i] is a hash of prefix of length i.
# For example s = "abacaba",
# h[0] = hash("") = 0
# h[1] = hash("a")
# h[2] = hash("ab")
# ...
# h[7] = hash("abacaba")

h[0] = 0
for i in 1..n:
h[i] = (h[i-1] * p + ord(s[i-1])) % mod

Also let's precalculate p^0 % mod, p^1 % mod, ..., p^len(s) % mod in array pow:

# pow[i] is (p^i) % mod
pow[0] = 1
for i in 1..n:
pow[i] = (pow[i-1] * p) % mod

Using arrays h and pow we can easily calculate hash of any substring of the string s:

# get_substring_hash returns hash(s[l] + s[l+1] + ... + s[r-1]).
def get_substring_hash(s, l, r):
value = h[r] - h[l]*pow[r-l] # line a
return (value%mod + mod) % mod # line b

Let's understand why code above works.

h[r] = (ord(s[r-1])*p^0 + ord(s[r-2])*p^1 + ... + ord(s[l-1])*p^(r-l) + ord(s[l-2])*p^(r-l+1) + ...) % mod
h[l] = ( ord(s[l-1])*p^0 + ord(s[l-2])*p^1 + ...) % mod
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As you can see we need only ^^^-part from h[r] so we have to get rid of ~~~-part. ~~~-part in h[r] p^(r-l) times larger than in h[l] and this explains line a.

Line b is kinda magic when operating with % mod, value after line a can be negative, so value%mod + mod definitely makes is positive. At the same time if value was positive after line a value%mod + mod is larger than mod, so (value%mod + mod) % mod will definitely return value in range 0, 1, ..., mod-1.

Finally, mod is a large prime number like 10^9+7 and value is a small number, but larger than any possible ASCII-code like 239 (read in article why so).

Some notes:

  • Hashes may collide, because we have only mod possible values of hash but number of strings is infinite. How to deal with it read in article.
  • Doing operations like h[r] - h[l]*pow[r-l] may require using of 64-bit types of integers.

2. Binary search

Just read about it on Wikipedia, there's nothing specific https://en.wikipedia.org/wiki/Binary_search_algorithm.

3. Longest substring repeating at least k times solution

Let's say we precalculated arrays h and pow. Let's do binary search to find maximum length ans of string such that there are k or more such substrings in the given string s.

Why binary search works here? Because if there's some length x such as there are k or more equal substrings in s of length x, then there's definitely k or more equal substrings in s of length x-1 (just remove last letter from our matches).

How will binary search work? Let's say we're currently testing if there are k or more equal substrings of length mid. We're going to calculate all hashes of length mid (using get_substring_hash) and we'll make a decision of changing borders of binary search if there is a k equal hashes or not.

For example: s = "abcabcdefgdefgdefgdefg", k = 3. Answer is "defgdefg":

abcabcdefgdefgdefgdefg
^^^^^^^^ occurence 1
^^^^^^^^ occurence 2
^^^^^^^^ occurence 3

Binary search iterations:

l =  1, r = 22, mid = 11. No substring of length 11 satisfy.
l = 1, r = 10, mid = 5. There should be hash("defgd") be seen 3 times.
l = 5, r = 10, mid = 7. There should be hash("defgdef") be seen 3 times.
l = 7, r = 10, mid = 8. There should be hash("defgdefg") be seen 3 times.
l = 8, r = 10, mid = 9. No substring of length 9 satisfy.
l = 8, r = 8. That means answer is 8.

As you can see, complexity is O(n log n): round(log n) binary search iterations and O(n) complexity per iteration if you use something like std::unordered_map to check if there's a hash with >= k occurrences or not.

I really hope everything is clear now.

Find the longest repeated subsequence of max length N in Python

Here is a simpler and more efficient solution:

def longest_subsequence(events, limit, sep='->'):
events = list(enumerate(events.split(sep)))
output = {}
seen = {}
for n in range(limit, 0, -1):
for combination in itertools.combinations(events, n):
indexes, key = zip(*combination)
if key in seen:
if key not in output and seen[key].isdisjoint(indexes):
output[key] = sep.join(key)
else:
seen[key] = set(indexes)
if output:
break
return list(output.values())

This looks at the longest matches first and terminates early if one is found. It eliminates self-overlapping repeated subsequences by saving the indexes of the last match and comparing them with the current candidate.

Demo:

samples = (
'A->B->E->D->A->C->B->D',
'A->B->C->A->B',
'A->B->A',
'A->B->E->D->A->C->B->E->D->A',
'B->B->B->C->C',
'A->B->A->B->C->C',
'A',
'',
)

for index, sample in enumerate(samples, 1):
result = longest_subsequence(sample, 4)
print('(%s) %r\n%s\n' % (index, sample, result))

Output:

(1) 'A->B->E->D->A->C->B->D'
['A->B->D']

(2) 'A->B->C->A->B'
['A->B']

(3) 'A->B->A'
['A']

(4) 'A->B->E->D->A->C->B->E->D->A'
['A->B->E->D', 'B->E->D->A']

(5) 'B->B->B->C->C'
['B->C']

(6) 'A->B->A->B->C->C'
['A->B->C']

(7) 'A'
[]

(8) ''
[]

Match longest repeating sequence (that is not made of a repeating sequence)

You can use two lookaheads, the first one searches the longest pattern and the second searches the smallest. The second pattern repeated has to end (at least) at the same position or after the first pattern repeated. To check that, you have to capture the end of the string in the first lookahead and to use a backreference to this capture in the second lookahead.

(?=(.+)\1+(.*))(?=(.+?)\3+\2$)\3+

demo

The result is in the group 3

also:

(?=(.+)\1+(.*))(.+?)\3+(?=\2$)\3*

Note that these two regex patterns give the result for one position in a string. If you want to know what is the shortest pattern that is the longest substring once repeated for all the string, you have to find all of them and to select the longest susbstring with code. You can use a pattern for overlapping results to do that:

(?=(.+)\1+(.*))(?=(.+?)\3+\2$)(?=(\3+))

(using the group 4)



Related Topics



Leave a reply



Submit