﻿ Finding a Subsequence in Longer Sequence - ITCodar

# Finding a Subsequence in Longer Sequence

## How to find the longest continuous subsequence whose reverse is also a subsequence

Actually this is the longest common substring problem applied to the sequence and its reverse: http://en.wikipedia.org/wiki/Longest_common_substring_problem

This is distinct from longest common subsequence: http://en.wikipedia.org/wiki/Subsequence#Substring_vs._subsequence

## Longest common sub-sequence with a certain property?

First let's reduce the number of states: Let f(i, j, d) be the length of the longest common zig-zag sequence starting at position i in the first string and position j in the second string and starting with direction d (up or down).

We have the recurrence

f(i, j, up) >= MAX(i' > i, j' > j : f(i', j', up))
if s1[i] = s2[j]:
f(i, j, up) >= MAX(i' > i, j' > j, s1[i'] > x : f(i', j', down))

an similar for the down direction. Solving this in a straightforward way
will lead to a runtime of something like O(n4 · W) where W is the range of integers in the array. W is not polynomially bounded, so we definitely want to get rid of this factor, and ideally a couple of n factors along the way.

To solve the first part, you have to find the maximum f(i', j', up) with
i' > i and j' > j. This is a standard standard 2-d orthogonal range maximum query.

For the second case, you need to find the maximum (i', j', down) with i' > i, j' > j and s1[i'] > s1[i]. That is a range maximum query in the rectangle (i, ∞) x (j, ∞) x (s1[i], ∞).

Now having 3 dimensions here looks scary. However, if we process the states in say, decreasing order of i, then we can get rid of one dimension.
We thus reduced the problem to a range query in the rectangle (j, ∞) x (s1[i], ∞). Coordinate compression gets the dimension of values down to O(n).

You can use a 2-d data structure such as a range tree or binary-indexed tree to solve both kinds of range queries in O(log2 n). The total runtime will be O(n2 · log2 n).

You can get rid of one log factor using fractional cascading, but that is associated with a high constant factor. The runtime is then only one log-factor short of that for finding the longest common subsequence, which seems like a lower-bound for our problem.

## Longest Common Sub-sequence of N sequences (for diff purposes)

Yes, you're missing the correctness: there is no guarantee that the LCS of a pair of strings will have any overlap whatsoever with the LCS of the set overall. Consider this example:

aaabb1xyz
aaabb2xyz
cccdd1xyz
cccdd2xyz

If you pair these in the given order, you'll get LCSs of aaabb and cccdd, missing the xyz for the set.

If, as you say, the strings are almost all identical, perhaps the differences aren't a problem for you. If the not-identical strings are very similar to the "median" string, then your incremental solution will work well enough for your purposes.

Another possibility is to do LCS on random pairs of strings until that median string emerges; then you start from that common point, and you should have a "good enough" solution.