PHP (Fuzzy) Search Matching

PHP / SQL - Improving search feature / fuzzy search

You are looking for Full-Text Searches WITH QUERY EXPANSION

MySQL supports text searching by using the LIKE operator and regular expression. However, when the text column is large and the number of rows in a table is increased, using these methods has some limitations:

  • Performance: MySQL has to scan the whole table to find the exact text based on a pattern in the LIKE statement or pattern in the regular expressions.
  • Flexible search: with the LIKE operator and regular expression searches, it is difficult to have a flexible search query e.g., to find product whose description contains car but not classic.
  • Relevance ranking: there is no way to specify which row in the result set is more relevant to the search terms.

Because of these limitations, MySQL extended a very nice feature so-called full-text search. Technically, MySQL creates an index from the words of the enabled full-text search columns and performs searches on this index. MySQL uses a sophisticated algorithm to determine the rows matched against the search query.

To do that, the columns that will be used for search must be in TEXT type and index of type FULLTEXT, index can be given using ALTER TABLE or CREATE INDEX and if you are using phpMyAdmin to manage your databases, you can do that by going to the Structure of that table, then click on More under Action of that column and choose Fulltext.

After that you can performe a search using MATCH AGAINST syntax. MATCH() takes the columns to be searched. AGAINST takes a string to search for, and an optional modifier that indicates what type of search to perform.

Full-Text Searches WITH QUERY EXPANSION:

In some cases, users want to search for information based on the knowledge that they have. Users use their experience to define keywords to search for information, and typically those keywords are too short.

To help users to find information based on the too-short keywords, MySQL full-text search engine introduces a concept called query expansion.

The query expansion is used to widen the search result of the full-text searches based on automatic relevance feedback (or blind query expansion). Technically, MySQL full-text search engine performs the following steps when the query expansion is used:

  • First, MySQL full-text search engine looks for all rows that match the search query.
  • Second, it checks all rows in the search result and finds the relevant words.
  • Third, it performs a search again based on the relevant words instead of the original keywords provided by the users.

The following example shows you how to search for a product whose product name or meta contains at least one word (shirt tshirt).

SELECT * FROM products WHERE MATCH(product_name,product_meta) AGAINST('shirt tshirt' WITH QUERY EXPANSION)

You can read more info in MYSQL document (the link at the beginning of the answer) and here

Also don't miss How Fine-Tuning MySQL Full-Text Search

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).

The drawbacks of SOUNDEX() are:

  • its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
  • the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
  • for MySQL, at least according to the docs, SOUNDEX is broken for unicode input

Example:

SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')

/* all of these return 'M262' */

For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).

search through array of strings for fuzzy string match

Map $arr1 to a function that calculates the string-edit-distance to the keys in $arr2, and then returns the closest match. Take a look at this Levenshtein distance function. Or, you could simply do a startsWith comparison in your mapping function.

You'll likely have something that looks like this:

$stringEditDistanceThreshold = 5; // greater than this means rejected

// define the mapping function
function findClosestMatchingString($s) {
$closestDistanceThusFar = $stringEditDistanceThreshold + 1;
$closestMatchValue = null;

foreach ($arr2 as $key => $value) {
$editDistance = levenshtein($key, $s);

// exact match
if ($editDistance == 0) {
return $value;

// best match thus far, update values to compare against/return
} elseif ($editDistance < $closestDistanceThusFar) {
$closestDistanceThusFar = $editDistance;
$closestMatchValue = $value;
}
}

return $closestMatch; // possible to return null if threshold hasn't been met
}

// do the mapping
$matchingValues = array_map('findClosestMatchingString', $arr1);

You'll likely have to tune the $stringEditDistanceThreshold until you get values that you're happy with. Or you could use a startsWith function, which would greatly simplify what findClosestMatchingString has to do.

Finally, this isn't very efficient. It's effectively an ugly nested loop. You may be able to do some pruning or something else clever, but I suspect if the arrays are fairly small, you may not care.

EDIT: As stated by @Ohgodwhy in the comment below, preg_grep will likely work even better for you. In that case, your map function will look something like the following:

function findFirstMatchingString($s) {
$matchingKeys = preg_grep($s, array_keys($arr2));

if (!empty($matchingKeys) {
// return the value of the first match
return $arr2[$matches[0]];
}

return null;
}

Best way to improve (fuzzy) search results for similarities?

This will be very hard to solve in the way you are trying. You need a full text search engine like Sphinx or elastic search. That has support for languages and fuzzy search. More info about it here: https://en.wikipedia.org/wiki/Full-text_search

I recommend Sphinx:
http://sphinxsearch.com/docs/sphinx3.html#features-overview

The documentation is however very heavy. Another option is elasticsearch that has very good documentation.

Try to invent this your self will be very tricky and you will take time to get good result.

Searching a single MySQL text column with fuzzy matching

Full Text Search (FTS) is the terminology for the database functionality you desire. There's:

  • Native MySQL support (requires that the table be MyISAM)

    WHERE MATCH(column) 
    AGAINST('Rose', 'Crown')
  • Sphinx (3rd party)

  • Lucene/SOLR (3rd party)

Elastic search fuzzy match with exact matches showing first

I ended up not using fuzzy matching to solve my problem and went with using ngram's.

/**
* Map - Create a new index with property mapping
*/
public function map()
{
$params['index'] = self::INDEX;

$params['body']['settings'] = array(
'index' => array(
'analysis' => array(
'analyzer' => array(
'product_analyzer' => array(
'type' => 'custom',
'tokenizer' => 'whitespace',
'filter' => array('lowercase', 'product_ngram'),
),
),
'filter' => array(
'product_ngram' => array(
'type' => 'nGram',
'min_gram' => 3,
'max_gram' => 5,
),
)
),

)
);

//all the beans
$mapping = array(
'_source' => array(
'enabled' => true
),
'properties' => array(
'id' => array(
'type' => 'string',
),
'name' => array(
'type' => 'string',
'analyzer' => 'product_analyzer',
'boost' => '10',
),
'brand' => array(
'type' => 'string',
'analyzer' => 'product_analyzer',
'boost' => '5',
),
'description' => array(
'type' => 'string',
),
'barcodes' => array(
'type' => 'string'
),
),
);

$params['body']['mappings'][self::TYPE] = $mapping;

$this->_client->indices()->create($params);
}

public function search($query)
{
$return = $this->_client->search(
array(
'index' => self::INDEX,
'type' => self::TYPE,
'body' => array(
'query' => array(
'multi_match' => array(
'query' => $query,
'fields' => array('id', 'name', 'brand', 'description', 'barcodes'),
),
),
'size' => '5000',
),
)
);

$productIds = array();

if (!empty($return['hits']['hits'])) {
foreach ($return['hits']['hits'] as $hit) {
$productIds[] = $hit['_id'];
}
}

return $productIds;
}

The result is exactly what I was looking for. It constructs matches based on how many ngram part the search query has within it.

Fuzzy Text Search: Regex Wildcard Search Generator?

String distance functions are useless when you don't know what the right word is. I'd suggest pspell functions:

$p = pspell_new("en");
print_r(pspell_suggest($p, "crazzy"));

http://www.php.net/manual/en/function.pspell-suggest.php



Related Topics



Leave a reply



Submit