How to Find Similar Results and Sort by Similarity

How to find similar results and sort by similarity?

I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:

You may use Lucene as search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).

Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :

CREATE TABLE IF NOT EXISTS `tests`.`data_table` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` varchar(2000) CHARACTER SET latin1 NOT NULL,
  `description` text CHARACTER SET latin1 NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;

Use a data generator to generate some random data if you don't want to bother creating it yourself...

** NOTE ** : the column type should be latin1_bin to perform a case sensitive search instead of case insensitive with latin1. For unicode strings, I would recommend utf8_bin for case sensitive and utf8_general_ci for case insensitive searches.

DROP TABLE IF EXISTS `tests`.`data_table_temp`;
CREATE TEMPORARY TABLE `tests`.`data_table_temp`
   SELECT * FROM `tests`.`data_table`;

ALTER TABLE `tests`.`data_table_temp`  ENGINE = MYISAM;

ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` (
  `title` ,
  `description`
);

SELECT *,
       MATCH (`title`,`description`)
       AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score`
  FROM `tests`.`data_table_temp`
 WHERE MATCH (`title`,`description`)
       AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE)
 ORDER BY `score` DESC;

DROP TABLE `tests`.`data_table_temp`;

Read more about it from the MySQL API reference page

The downside to this is that it will not look for letter transposition or "similar, sounds like" words.

** UPDATE **

Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:

$indexer = Zend_Search_Lucene::create('/path/to/lucene/data');

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
  // change this option for your need
  new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

$rowSet = getDataRowSet();  // perform your SQL query to fetch whatever you need to index
foreach ($rowSet as $row) {
   $doc = new Zend_Search_Lucene_Document();
   $doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8'))
       ->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8'))
       ->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable))
       ->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8'))
  ;
  $indexer->addDocument($doc);
}

// ... you can get as many $rowSet as you want and create as many documents
// as you wish... each document doesn't necessarily need the same fields...
// Lucene is pretty flexible on this

$indexer->optimize();  // do this every time you add more data to you indexer...
$indexer->commit();    // finalize the process

Then, this is basically how you search (basic search) :

$index = Zend_Search_Lucene::open('/path/to/lucene/data');

// same search options
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
   new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');

$query = 'php +field1:foo';  // search for the word 'php' in any field,
                                 // +search for 'foo' in field 'field1'

$hits = $index->find($query);

$numHits = count($hits);
foreach ($hits as $hit) {
   $score = $hit->score;  // the hit weight
   $field1 = $hit->field1;
   // etc.
}

Here are great sites about Lucene in Java, PHP, and .Net.

In conclusion each search methods have their own pros and cons :

You mentioned Sphinx search and it looks very good, as long as you can make the deamon run on your web host.
Zend Lucene requires a cron job to re-index the database. While it is quite transparent to the user, this means that any new data (or deleted data!) is not always in sync with the data in your database and therefore won't show up right away on user search.
MySQL FULLTEXT search is good and fast, but will not give you all the power and flexibility of the first two.

Please feel free to comment if I have forgotten/missed anything.

How to match and sort by similarity in MySQL?

I am not sure if LIKE is the right way to do this. If you need to search inside your text for keywords and sort results by relevancy score, you should use MySQL Full-Text index and MySQL Full-text Search functions. Sorry if this leads you away from what you are actually trying to do but I do recommend having one look at it. Some quotes from MySQL reference manual:

1) How to create full text index on multiple columns of a table

mysql> CREATE TABLE articles (
    ->   id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
    ->   title VARCHAR(200),
    ->   body TEXT,
    ->   FULLTEXT (title,body)
    -> );

2) Sample data

mysql> INSERT INTO articles (title,body) VALUES
    -> ('MySQL Tutorial','DBMS stands for DataBase ...'),
    -> ('How To Use MySQL Well','After you went through a ...'),
    -> ('Optimizing MySQL','In this tutorial we will show ...'),
    -> ('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
    -> ('MySQL vs. YourSQL','In the following database comparison ...'),
    -> ('MySQL Security','When configured properly, MySQL ...');

3) Sample query that searches multiple columns for keywords and displays result + the score:

mysql> SELECT id, body, MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root') AS score
    -> FROM articles WHERE MATCH (title,body) AGAINST
    -> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body                                | score           |
+----+-------------------------------------+-----------------+
|  4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
|  6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+

Getting most similar rows in MySQL table and order them by similarity

As in my table currently I have only around 5k rows and they are slowly growing, I decided to actually use the following simple approach (it came to me just after I wrote the question).

The seed lets say is Honda Accord (model_id 456), 2004, gasoline, 2.0L, 155hp, sedan with auto-inc ID 123.

SELECT vehicles.*,  
    (IF(`fuel_type`='gasoline', 3, 0) + 
     IF(`body_style`='sedan', 1, 0) + 
     IF(`year` > 2001 AND `year` < 2007, 2, 0) + 
     IF(`engine_size` >= 1.8 AND `engine_size` <= 2.2, 1, 0) + 
     IF(`engine_power`=155, 3, IF(`engine_power`>124 AND `engine_power`<186, 1, 0))) AS `rank`
FROM vehicles
WHERE vehicle_id!=123 AND model_id=456
ORDER BY `rank` DESC
LIMIT 3

It will work, as long as I don't too many rows. If the table becomes 50-100k, I probably will have to switch to something like Lucene?

Sort results by search term similarity

Unfortunately, MongoDB doesn't support full text search ranking by default.

First of all, you will need a algorithm to calculate the similarity between strings. See following links:

String similarity algorithims?

String similarity -> Levenshtein distance

Then you need to write a javascript function using the algorithm to compare two strings to pass it in your query. See the following link to see how to achieve that:

Mongo complex sorting?

how to compute similarity between two strings in MYSQL

you can use this function (cop^H^H^Hadapted from http://www.artfulsoftware.com/infotree/queries.php#552):

CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END

and for getting it as XX% use this function

CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END

How to Find Similar Results and Sort by Similarity