Find Common Substring Between Two Strings

Find common substring between two strings

Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).

def longestSubstringFinder(string1, string2):
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
match = ""
for j in range(len2):
if (i + j < len1 and string1[i + j] == string2[j]):
match += string2[j]
else:
if (len(match) > len(answer)): answer = match
match = ""
return answer

print(longestSubstringFinder("apple pie available", "apple pies"))
print(longestSubstringFinder("apples", "appleses"))
print(longestSubstringFinder("bapples", "cappleses"))

Output

apple pie
apples
apples

Finding all the common substrings of given two strings

You would be better off with a proper algorithm for the task rather than a brute-force approach. Wikipedia describes two common solutions to the longest common substring problem: suffix-tree and dynamic-programming.

The dynamic programming solution takes O(n m) time and O(n m) space. This is pretty much a straightforward Java translation of the Wikipedia pseudocode for the longest common substring:

public static Set<String> longestCommonSubstrings(String s, String t) {
int[][] table = new int[s.length()][t.length()];
int longest = 0;
Set<String> result = new HashSet<>();

for (int i = 0; i < s.length(); i++) {
for (int j = 0; j < t.length(); j++) {
if (s.charAt(i) != t.charAt(j)) {
continue;
}

table[i][j] = (i == 0 || j == 0) ? 1
: 1 + table[i - 1][j - 1];
if (table[i][j] > longest) {
longest = table[i][j];
result.clear();
}
if (table[i][j] == longest) {
result.add(s.substring(i - longest + 1, i + 1));
}
}
}
return result;
}

Now, you want all of the common substrings, not just the longest. You can enhance this algorithm to include shorter results. Let's examine the table for the example inputs eatsleepnightxyz and eatsleepabcxyz:

  e a t s l e e p a b c x y z
e 1 0 0 0 0 1 1 0 0 0 0 0 0 0
a 0 2 0 0 0 0 0 0 1 0 0 0 0 0
t 0 0 3 0 0 0 0 0 0 0 0 0 0 0
s 0 0 0 4 0 0 0 0 0 0 0 0 0 0
l 0 0 0 0 5 0 0 0 0 0 0 0 0 0
e 1 0 0 0 0 6 1 0 0 0 0 0 0 0
e 1 0 0 0 0 1 7 0 0 0 0 0 0 0
p 0 0 0 0 0 0 0 8 0 0 0 0 0 0
n 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i 0 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 0 0 0 0 0 0 0 0 0 0 0 0 0 0
t 0 0 1 0 0 0 0 0 0 0 0 0 0 0
x 0 0 0 0 0 0 0 0 0 0 0 1 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0 2 0
z 0 0 0 0 0 0 0 0 0 0 0 0 0 3
  • The eatsleep result is obvious: that's the 12345678 diagonal streak at the top-left.
  • The xyz result is the 123 diagonal at the bottom-right.
  • The a result is indicated by the 1 near the top (second row, ninth column).
  • The t result is indicated by the 1 near the bottom left.

What about the other 1s at the left, the top, and next to the 6 and 7? Those don't count because they appear within the rectangle formed by the 12345678 diagonal — in other words, they are already covered by eatsleep.

I recommend doing one pass doing nothing but building the table. Then, make a second pass, iterating backwards from the bottom-right, to gather the result set.

Function to find all common substrings in two strings not giving correct output

  1. Don't break
  2. Check the length of the string before adding it to avoid results like "A"
  3. return the function result instead of printing inside the function

Like this:

def substringFinder(string1, string2):
answer = ""
anslist=[]
len1, len2 = len(string1), len(string2)
for i in range(len1):
match = ""
for j in range(len2):
if (i + j < len1 and string1[i + j] == string2[j]):
match += string2[j]
else:
#if (len(match) > len(answer)):
answer = match
if answer != '' and len(answer) > 1:
anslist.append(answer)
match = ""

if match != '':
anslist.append(match)
# break
return anslist

print substringFinder("AHAMMAD", "AHAMAD")

result:
['AHAM', 'MAD']

All Common Substrings Between Two Strings

    public static string [] CommonString(string left, string right)
{
List<string> result = new List<string>();
string [] rightArray = right.Split();
string [] leftArray = left.Split();

result.AddRange(rightArray.Where(r => leftArray.Any(l => l.StartsWith(r))));

// must check other way in case left array contains smaller words than right array
result.AddRange(leftArray.Where(l => rightArray.Any(r => r.StartsWith(l))));

return result.Distinct().ToArray();
}

How can I find the Largest Common Substring between two strings in PHP?

I have since found a relevant wikipedia article. It is not a NP complete problem, it can be done in O(mn) time using a dynamic programming algorithm.

In PHP I found the similar_text function very useful. Here's a code sample to retrieve a series of text emails and loop through them and find ones that are 90% similar to each other. Note: Something like this is NOT scalable:

<?php
// Gather all messages by a user into two identical associative arrays
$getMsgsRes = mysql_query(SELECT * FROM email_messages WHERE from = '$someUserID');
while($msgInfo = mysql_fetch_assoc($getMsgsRes))
{
$msgsInfo1[] = $msgInfo;
$msgsInfo2[] = $msgInfo;
}

// Loop over msgs and compare each one to every other
foreach ($msgsInfo1 as $msg1)
foreach ($msgsInfo2 as $msg2)
similar_text($msg1['msgTxt'],$msg2['msgTxt'],$similarity_pst);
if ($similarity_pst > 90)
echo "{$msg1['msgID']} is ${similarity_pst}% to {$msg2['msgID']}\n";
?>


Related Topics



Leave a reply



Submit