How to Find the Largest Common Substring Between Two Strings in PHP

How can I find the Largest Common Substring between two strings in PHP?

I have since found a relevant wikipedia article. It is not a NP complete problem, it can be done in O(mn) time using a dynamic programming algorithm.

In PHP I found the similar_text function very useful. Here's a code sample to retrieve a series of text emails and loop through them and find ones that are 90% similar to each other. Note: Something like this is NOT scalable:

<?php
// Gather all messages by a user into two identical associative arrays
$getMsgsRes = mysql_query(SELECT * FROM email_messages WHERE from = '$someUserID');
while($msgInfo = mysql_fetch_assoc($getMsgsRes))
{
$msgsInfo1[] = $msgInfo;
$msgsInfo2[] = $msgInfo;
}

// Loop over msgs and compare each one to every other
foreach ($msgsInfo1 as $msg1)
foreach ($msgsInfo2 as $msg2)
similar_text($msg1['msgTxt'],$msg2['msgTxt'],$similarity_pst);
if ($similarity_pst > 90)
echo "{$msg1['msgID']} is ${similarity_pst}% to {$msg2['msgID']}\n";
?>

How to get a substring between two strings in PHP?

If the strings are different (ie: [foo] & [/foo]), take a look at this post from Justin Cook.
I copy his code below:

function get_string_between($string, $start, $end){
$string = ' ' . $string;
$ini = strpos($string, $start);
if ($ini == 0) return '';
$ini += strlen($start);
$len = strpos($string, $end, $ini) - $ini;
return substr($string, $ini, $len);
}

$fullstring = 'this is my [tag]dog[/tag]';
$parsed = get_string_between($fullstring, '[tag]', '[/tag]');

echo $parsed; // (result = dog)

Print ALL common parts between 2 strings

I modified your function to get longest matches in array

<?php
function getLongestMatchingSubstring($str1, $str2)
{
$len_1 = strlen($str1);
$longest = array();
for($i = 0; $i < $len_1; $i++){
for($j = $len_1 - $i; $j > 0; $j--){
$sub = substr($str1, $i, $j);
if (strpos($str2, $sub) !== false && strlen($sub) > 1){
$longest[] = $sub;
$i = strpos($str1, $sub) + (strlen($sub)-1);
break;
}
}
}
return $longest;
}

$string1 = 'asbaMbaeMATCH1asjdndreyMATCH22r5g7jdg3MATCH33';
$string2 = '1241M342gtMAThgMATCH1m1g5mMATCH2cghdiMATCH33';
print_r(getLongestMatchingSubstring($string1, $string2));

the output would be

Array([0] => M [1] => MATCH1 [2] => MATCH2 [3] => MATCH33 )

Find common substring between two strings

Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).

def longestSubstringFinder(string1, string2):
answer = ""
len1, len2 = len(string1), len(string2)
for i in range(len1):
match = ""
for j in range(len2):
if (i + j < len1 and string1[i + j] == string2[j]):
match += string2[j]
else:
if (len(match) > len(answer)): answer = match
match = ""
return answer

print(longestSubstringFinder("apple pie available", "apple pies"))
print(longestSubstringFinder("apples", "appleses"))
print(longestSubstringFinder("bapples", "cappleses"))

Output

apple pie
apples
apples

Find common substrings from 2 separate sets

As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.

Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.

Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.

PHP: find biggest overlap between multiple strings

Update

I've completely misunderstood the question; the aim was to find the biggest overlap between an array of strings:

$array = array('abc123', 'ac123', 'tbc123', '1ac123');

function overlap($a, $b)
{
if (!strlen($b)) {
return '';
}

if (strpos($a, $b) !== false) {
return $b;
}

$left = overlap($a, substr($b, 1));
$right = overlap($a, substr($b, 0, -1));

return strlen($left) > strlen($right) ? $left : $right;
}

$biggest = null;
foreach ($array as $item) {
if ($biggest === null) {
$biggest = $item;
}
if (($biggest = overlap($biggest, $item)) === '') {
break;
}
}

echo "Biggest match = $biggest\n";

I'm not great at recursion, but I believe this should work ;-)

Old answer

I would probably use preg_grep() for that; it returns an array with the matches it found based on your search string:

$matches = preg_grep('/' . preg_quote($find, '/') . '/', $array);

Alternatively, you could use array_filter():

$matches = array_filter($array, function($item) use ($find) {
return strpos($item, $find) !== false;
});

I need to extract the value "c123" like it is the biggest match for all strings in array

I think what you would want to do here is then sort the above output based on string length (i.e. smallest string length first) and then take the first item:

if ($matches) {
usort($matches, function($a, $b) {
return strlen($a) - strlen($b);
});
echo current($matches); // take first one: ac123
}

Let me know if I'm wrong about that.


If you're just after knowing whether $find matches an element exactly:

$matching_keys = array_keys($array, $find, true); // could be empty array

Or:

$matching_key = array_search($find, $array, true); // could be false

Or event:

$have_value = in_array($find, $array, true);


Related Topics



Leave a reply



Submit