How Does Similar_Text Work

How does similar_text work?

It would indeed seem the function uses different logic depending of the parameter order. I think there are two things at play.

First, see this example:

echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2

It seems to be that it is testing "how many times any distinct char on param1 is found in param2", and thus result would be different if you swap the params around. It has been reported as a bug, which has been closed as "working as expected".

Now, the above is the same for both PHP and javascript implementations - paremeter order has an impact, so saying that JS code wouldn't do this is wrong. This is argued in the bug entry as intended behaviour.

Second - what doesn't seem correct is the MYSQL/PHP word example. With that, javascript version gives 3 irrelevant of the order of params, whereas PHP gives 2 and 3 (and due to that, percentage is equally different). Now, the phrases "PHP IS GREAT" and "WITH MYSQL" should have 5 characters in common, irrelevant of which way you compare: H, I, S and T, one each, plus one for empty space. In order they have 3 characters, 'H', ' ' and 'S', so if you look at the ordering, correct answer should be 3 both ways. I modified the C code to a runnable version, and added some output, so one can see what is happening there (codepad link):

#include<stdio.h>

/* {{{ php_similar_str
*/
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
char *p, *q;
char *end1 = (char *) txt1 + len1;
char *end2 = (char *) txt2 + len2;
int l;

*max = 0;
for (p = (char *) txt1; p < end1; p++) {
for (q = (char *) txt2; q < end2; q++) {
for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
if (l > *max) {
*max = l;
*pos1 = p - txt1;
*pos2 = q - txt2;
}
}
}
}
/* }}} */

/* {{{ php_similar_char
*/
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
int sum;
int pos1, pos2, max;

php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);

if ((sum = max)) {
if (pos1 && pos2) {
printf("txt here %s,%s\n", txt1, txt2);
sum += php_similar_char(txt1, pos1,
txt2, pos2);
}
if ((pos1 + max < len1) && (pos2 + max < len2)) {
printf("txt here %s,%s\n", txt1+ pos1 + max, txt2+ pos2 + max);
sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
txt2 + pos2 + max, len2 - pos2 - max);
}
}

return sum;
}
/* }}} */
int main(void)
{
printf("Found %d similar chars\n",
php_similar_char("PHP IS GREAT", 12, "WITH MYSQL", 10));
printf("Found %d similar chars\n",
php_similar_char("WITH MYSQL", 10,"PHP IS GREAT", 12));
return 0;
}

the result is output:

txt here PHP IS GREAT,WITH MYSQL
txt here P IS GREAT, MYSQL
txt here IS GREAT,MYSQL
txt here IS GREAT,MYSQL
txt here GREAT,QL
Found 3 similar chars
txt here WITH MYSQL,PHP IS GREAT
txt here TH MYSQL,S GREAT
Found 2 similar chars

So one can see that on the first comparison, the function found 'H', ' ' and 'S', but not 'T', and got the result of 3. The second comparison found 'I' and 'T' but not 'H', ' ' or 'S', and thus got the result of 2.

The reason for these results can be seen from the output: algorithm takes the first letter in the first string that second string contains, counts that, and throws away the chars before that from the second string. That is why it misses the characters in-between, and that's the thing causing the difference when you change the character order.

What happens there might be intentional or it might not. However, that's not how javascript version works. If you print out the same things in the javascript version, you get this:

txt here: PHP, WIT
txt here: P IS GREAT, MYSQL
txt here: IS GREAT, MYSQL
txt here: IS, MY
txt here: GREAT, QL
Found 3 similar chars
txt here: WITH, PHP
txt here: W, P
txt here: TH MYSQL, S GREAT
Found 3 similar chars

showing that javascript version does it in a different way. What the javascript version does is that it finds 'H', ' ' and 'S' being in the same order in the first comparison, and the same 'H', ' ' and 'S' also on the second one - so in this case the order of params doesn't matter.

As the javascript is meant to duplicate the code of PHP function, it needs to behave identically, so I submitted bug report based on analysis of @Khez and the fix, which has been merged now.

How do I make the PHP similar_text() function work for Japanese Characters (Kanji, Katakana and Hiragana)?

You will want to handle the possible multibytes that are forming the Kanji characters. I am not 100% confident but I suspect similar_text does not support mb and you need a similar solution that can.

This links show peoples attempts at handling mb char similar to the php function.

https://gist.github.com/soderlind/74a06f9408306cfc5de9

https://github.com/antalaron/mb-similar-text

I have not personally tested this but the approach could be right or inspire you to write a custom function.

Also covered in this other post:

how to use similar text php code in arabic

How to improve PHP string match with similar_text()?

Levenshtein distance: http://php.net/manual/en/function.levenshtein.php

It's reverse to similar_text(), so 0% means there is no difference.

// <!-- Overcast, Rain or Showers compared Overcast, Rain or Showers is 0 -->
// <!-- Overcast, Risk of Rain or Showers compared Overcast, Rain or Showers is 11 -->
// <!-- Overcast, Chance of Rain or Showers compared Overcast, Rain or Showers is 13 -->

PHP's similar_text() and in_array() not working as it seems like they should

Change your add to line to

function addTo($line){
return strtolower(trim($line));
}

and change you input to

$input = strtolower(trim($_GET['checkSpelling']));

The file command has a nasty habit of leaving the trailing newline character so you probably aren't matching based on that ... the trim should take care of that. The other changes are just there to make it case insensitive.

Algorithms for string similarities (better than Levenshtein, and similar_text)? Php, Js

Here's a solution that I've come up to. It's based on Tim's suggestion of comparing the order of subsequent charachters. Some results:

  • jonas / jonax : 0.8
  • jonas / sjona : 0.68
  • jonas / sjonas : 0.66
  • jonas / asjon : 0.52
  • jonas / xxjon : 0.36

I'm sure i isn't perfect, and that it could be optimized, but nevertheless it seems to produce the results that I'm after...
One weak spot is that when strings have different length, it produces different result when the values are swapped...

static public function string_compare($str_a, $str_b) 
{
$length = strlen($str_a);
$length_b = strlen($str_b);

$i = 0;
$segmentcount = 0;
$segmentsinfo = array();
$segment = '';
while ($i < $length)
{
$char = substr($str_a, $i, 1);
if (strpos($str_b, $char) !== FALSE)
{
$segment = $segment.$char;
if (strpos($str_b, $segment) !== FALSE)
{
$segmentpos_a = $i - strlen($segment) + 1;
$segmentpos_b = strpos($str_b, $segment);
$positiondiff = abs($segmentpos_a - $segmentpos_b);
$posfactor = ($length - $positiondiff) / $length_b; // <-- ?
$lengthfactor = strlen($segment)/$length;
$segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
}
else
{
$segment = '';
$i--;
$segmentcount++;
}
}
else
{
$segment = '';
$segmentcount++;
}
$i++;
}

// PHP 5.3 lambda in array_map
$totalscore = array_sum(array_map(function($v) { return $v['score']; }, $segmentsinfo));
return $totalscore;
}

What is C# full analog PHP function similar_text()?

First, you have to implement edit distance, say, Levenstein one; let's do it as more generic as we can:

See Detect differences between two strings

for edit sequence and more details.

Code:

using System.Linq;

...

private static double EditDistance<T>(IEnumerable<T> from,
IEnumerable<T> to,
Func<T, double> insertCost,
Func<T, double> deleteCost,
Func<T, T, double> editCost) {
T[] source = from.ToArray();
T[] target = to.ToArray();

// Minimum cost so far
double[][] D = Enumerable
.Range(0, source.Length + 1)
.Select(line => new double[target.Length + 1])
.ToArray();

// Edge: all removes
double sum = 0.0;

for (int i = 1; i <= source.Length; ++i)
D[i][0] = (sum += deleteCost(source[i - 1]));

// Edge: all inserts
sum = 0.0;

for (int i = 1; i <= target.Length; ++i)
D[0][i] = (sum += insertCost(target[i - 1]));

// Having fit N - 1, K - 1 characters let's fit N, K
for (int i = 1; i <= source.Length; ++i)
for (int j = 1; j <= target.Length; ++j) {
// here we choose the operation with the least cost
double insert = D[i][j - 1] + insertCost(target[j - 1]);
double delete = D[i - 1][j] + deleteCost(source[i - 1]);
double edit = D[i - 1][j - 1] + editCost(source[i - 1], target[j - 1]);

D[i][j] = Math.Min(Math.Min(insert, delete), edit);
}

return D[source.Length][target.Length];
}

Now it's easy to write Similar_text

Routine:

public static double Similar_text(string left, string right) {
left = left ?? "";
right = right ?? "";

return left.Equals(right)
? 1.0
: 1.0 - EditDistance(left, right,
insert => 1.0,
delete => 1.0,
(from, to) => from == to ? 0.0 : 2.0)
/ Math.Max(left.Length, right.Length);
}

Time complexity is O(left.Length * right.Length) or O(N**2) if both strings have roughly equal lengths (N)

similar_text not giving expected result

similar_text() uses an algorithm that takes the first letter in the first string that the second string contains, counts that, and throws away the chars before that from the second string.
This is the reason why we get different results.

Iteration for first example

  'abcd' vs 'abcdefg' - (1) // 'a' match with 'a' 
'bcd' vs 'bcdefg' - (1) // 'b' match with 'b'
'cd' vs 'cdefg' - (1) // 'c' match with 'c'
'd' vs 'defg' - (1) // 'd' match with 'd'
'' vs 'efg' - (0) // no match
Result = 4

Iteration for second example

  'bacd' vs 'abcdefg'  - (0) // b not match a
'bacd' vs 'bcdefg' - (1) // b match b
'acd' vs 'cdefg' - (0) // a not match c
'cd' vs 'cdefg' - (1) // c match c
'd' vs 'defg' - (1) // d match d
'' vs 'efg' - (0) // not match with any elemennt
Result = 3


Related Topics



Leave a reply



Submit