Algorithms for string similarities (better than Levenshtein, and similar_text)? Php, Js
Here's a solution that I've come up to. It's based on Tim's suggestion of comparing the order of subsequent charachters. Some results:
- jonas / jonax : 0.8
- jonas / sjona : 0.68
- jonas / sjonas : 0.66
- jonas / asjon : 0.52
- jonas / xxjon : 0.36
I'm sure i isn't perfect, and that it could be optimized, but nevertheless it seems to produce the results that I'm after...
One weak spot is that when strings have different length, it produces different result when the values are swapped...
static public function string_compare($str_a, $str_b)
{
$length = strlen($str_a);
$length_b = strlen($str_b);
$i = 0;
$segmentcount = 0;
$segmentsinfo = array();
$segment = '';
while ($i < $length)
{
$char = substr($str_a, $i, 1);
if (strpos($str_b, $char) !== FALSE)
{
$segment = $segment.$char;
if (strpos($str_b, $segment) !== FALSE)
{
$segmentpos_a = $i - strlen($segment) + 1;
$segmentpos_b = strpos($str_b, $segment);
$positiondiff = abs($segmentpos_a - $segmentpos_b);
$posfactor = ($length - $positiondiff) / $length_b; // <-- ?
$lengthfactor = strlen($segment)/$length;
$segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
}
else
{
$segment = '';
$i--;
$segmentcount++;
}
}
else
{
$segment = '';
$segmentcount++;
}
$i++;
}
// PHP 5.3 lambda in array_map
$totalscore = array_sum(array_map(function($v) { return $v['score']; }, $segmentsinfo));
return $totalscore;
}
How to improve PHP string match with similar_text()?
Levenshtein distance: http://php.net/manual/en/function.levenshtein.php
It's reverse to similar_text(), so 0% means there is no difference.
// <!-- Overcast, Rain or Showers compared Overcast, Rain or Showers is 0 -->
// <!-- Overcast, Risk of Rain or Showers compared Overcast, Rain or Showers is 11 -->
// <!-- Overcast, Chance of Rain or Showers compared Overcast, Rain or Showers is 13 -->
Php check similarity of multiple strings
Well it isn't seems to be problem actually.
Because,
There can be different users with slight difference in their email id.
How can you tell that user with email ids : nike1@gmail.com and nike2@gmail.com are the same that of nike@gmail.com ?
but how ever if you want to check so :
1) You can remove the last numbers by using the regx or something similar
2) Then can check the original email id if it exists in your database.
PHP String Comparison and similarity index
see similar_text()
. And if you want to exclude spaces simple str_replace(' ', '', $string);
prior.
echo similar_text ( 'LEGENDARY' , 'BARNEYSTINSON', $percent); // outputs 3
echo $percent; // outputs 27.272727272727
Here's another way using only unique characters
<?php
function unique_chars($string) {
return count_chars(strtolower(str_replace(' ', '', $string)), 3);
}
function compare_strings($a, $b) {
$index = similar_text(unique_chars($a), unique_chars($b), $percent);
return array('index' => $index, 'percent' => $percent);
}
print_r( compare_strings('LEGENDARY', 'BARNEY STINSON') );
// outputs:
?>
Array
(
[index] => 5
[percent] => 55.555555555556
)
Related Topics
PHP Function Use Variable from Outside
How to Convert Ipv6 from Binary for Storage in MySQL
Get Current Url Path with Query String in PHP
Why I Can Not Login to Magento Backend Using Google Chrome
Get Updated Value in MySQL Instead of Affected Rows
C#'s Null Coalescing Operator () in PHP
Pdo Drivers No Value in Windows
Passing Data from Controller to View in Laravel
Which Is the Best Way to Generate Excel Output in PHP
Weird Error Using PHP Simple HTML Dom Parser
Accessing Session from Twig Template
PHP - How to Build Tree Structure List
Why Does PHP Convert a String with the Letter E into a Number
Http_Build_Query with Same Name Parameters
Finding the Minimum Value's Key in an Associative Array
Explode a String to Associative Array Without Using Loops
Count and Limit the Number of Files Uploaded (HTML File Input)