How to find the longest continuous subsequence whose reverse is also a subsequence
Actually this is the longest common substring problem applied to the sequence and its reverse: http://en.wikipedia.org/wiki/Longest_common_substring_problem
This is distinct from longest common subsequence: http://en.wikipedia.org/wiki/Subsequence#Substring_vs._subsequence
Longest common sub-sequence with a certain property?
First let's reduce the number of states: Let f(i, j, d)
be the length of the longest common zig-zag sequence starting at position i in the first string and position j in the second string and starting with direction d (up or down).
We have the recurrence
f(i, j, up) >= MAX(i' > i, j' > j : f(i', j', up))
if s1[i] = s2[j]:
f(i, j, up) >= MAX(i' > i, j' > j, s1[i'] > x : f(i', j', down))
an similar for the down
direction. Solving this in a straightforward way
will lead to a runtime of something like O(n4 · W) where W is the range of integers in the array. W is not polynomially bounded, so we definitely want to get rid of this factor, and ideally a couple of n factors along the way.
To solve the first part, you have to find the maximum f(i', j', up) with
i' > i and j' > j. This is a standard standard 2-d orthogonal range maximum query.
For the second case, you need to find the maximum (i', j', down) with i' > i, j' > j and s1[i'] > s1[i]. That is a range maximum query in the rectangle (i, ∞) x (j, ∞) x (s1[i], ∞).
Now having 3 dimensions here looks scary. However, if we process the states in say, decreasing order of i, then we can get rid of one dimension.
We thus reduced the problem to a range query in the rectangle (j, ∞) x (s1[i], ∞). Coordinate compression gets the dimension of values down to O(n).
You can use a 2-d data structure such as a range tree or binary-indexed tree to solve both kinds of range queries in O(log2 n). The total runtime will be O(n2 · log2 n).
You can get rid of one log factor using fractional cascading, but that is associated with a high constant factor. The runtime is then only one log-factor short of that for finding the longest common subsequence, which seems like a lower-bound for our problem.
Longest Common Sub-sequence of N sequences (for diff purposes)
Yes, you're missing the correctness: there is no guarantee that the LCS of a pair of strings will have any overlap whatsoever with the LCS of the set overall. Consider this example:
aaabb1xyz
aaabb2xyz
cccdd1xyz
cccdd2xyz
If you pair these in the given order, you'll get LCSs of aaabb
and cccdd
, missing the xyz
for the set.
If, as you say, the strings are almost all identical, perhaps the differences aren't a problem for you. If the not-identical strings are very similar to the "median" string, then your incremental solution will work well enough for your purposes.
Another possibility is to do LCS on random pairs of strings until that median string emerges; then you start from that common point, and you should have a "good enough" solution.
Related Topics
Is the C# Compiler Smart Enough to Optimize This Code
Task<> Does Not Contain a Definition for 'Getawaiter'
How Can Xml Documentation for Web API Include Documentation from Beyond the Main Project
Private Inner Classes in C# - Why Aren't They Used More Often
Dependency Injection VS Service Location
In C#, How to Cast a List<Child> to List<Parent>
Configurationmanager.Appsettings - How to Modify and Save
Having the Output of a Console Application in Visual Studio Instead of the Console
How to Pass an Object to Httpclient.Postasync and Serialize as a JSON Body
.Net Core 3.0: Razor Views Don't Automatically Recompile on Change
Copying Files into the Application Folder at Compile Time
Linq Order By, Group by and Order by Each Group