Find SQL Records Containing Similar Strings

Find sql records containing similar strings

If you really want to define similarity in the exact way that you have formulated in your question, then you would - as you say - have to implement the Levensthein Distance calculation. Either in code calculated on each row retrieved by a DataReader or as a SQL Server function.

The problem stated is actually more tricky than it may appear at first sight, because you cannot assume to know what the mutually shared elements between two strings may be.

So in addition to Levensthein Distance you probably also want to specify a minimum number of consecutive characters that actually have to match (in order for sufficient similarity to be concluded).

In sum: It sounds like an overly complicated and time consuming/slow approach.

Interestingly, in SQL Server 2008 you have the DIFFERENCE function which may be used for something like this.

It evaluates the phonetic value of two strings and calculates the difference. I'm unsure if you will get it to work properly for multi-word expressions such as movie titles since it doesn't deal well with spaces or numbers and puts too much emphasis on the beginning of the string, but it is still an interesting predicate to be aware of.

If what you are actually trying to describe is some sort of search feature, then you should look into the Full Text Search capabilities of SQL Server 2008. It provides built-in Thesaurus support, fancy SQL predicates and a ranking mechanism for "best matches"

EDIT: If you are looking to eliminate duplicates maybe you could look into SSIS Fuzzy Lookup and Fuzzy Group Transformation. I have not tried this myself, but it looks like a promising lead.

EDIT2: If you don't want to dig into SSIS and still struggle with the performance of the Levensthein Distance algorithm, you could perhaps try this algorithm which appears to be less complex.

How to find strings which are similar to given string in SQL server?

Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?

--This part searches for a string you want

declare @MyString varchar(max)

set @MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)

--This part searches for that string

select searchColumn, ABS(Len(searchColumn) - Len(@MyString)) as Similarity
from table where data LIKE '%' + @MyString + '%'
Order by Similarity, searchColumn

The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query.
The absolute part can be avoided obviously but I did it just in case.

Hope that helps =-)

Sql select rows containing part of string

SELECT *
FROM myTable
WHERE URL = LEFT('mysyte.com/?id=2®ion=0&page=1', LEN(URL))

Or use CHARINDEX
http://msdn.microsoft.com/en-us/library/aa258228(v=SQL.80).aspx

SQL Finding Records Containing Same Character 1 or More Times

If you want strings that are all the same character, one method uses replace():

where len(replace(upper(col), upper(left(col, 1)), '')) = 0

upper() is not needed for case-insensitive collations.

You can also use replicate():

where upper(col) = replicate(left(upper(col), 1), len(col))

How to find similar results and sort by similarity?

I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:

You may use Lucene as search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).

Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :

CREATE TABLE IF NOT EXISTS `tests`.`data_table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(2000) CHARACTER SET latin1 NOT NULL,
`description` text CHARACTER SET latin1 NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;

Use a data generator to generate some random data if you don't want to bother creating it yourself...

** NOTE ** : the column type should be latin1_bin to perform a case sensitive search instead of case insensitive with latin1. For unicode strings, I would recommend utf8_bin for case sensitive and utf8_general_ci for case insensitive searches.

DROP TABLE IF EXISTS `tests`.`data_table_temp`;
CREATE TEMPORARY TABLE `tests`.`data_table_temp`
SELECT * FROM `tests`.`data_table`;

ALTER TABLE `tests`.`data_table_temp` ENGINE = MYISAM;

ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` (
`title` ,
`description`
);

SELECT *,
MATCH (`title`,`description`)
AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score`
FROM `tests`.`data_table_temp`
WHERE MATCH (`title`,`description`)
AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE)
ORDER BY `score` DESC;

DROP TABLE `tests`.`data_table_temp`;

Read more about it from the MySQL API reference page

The downside to this is that it will not look for letter transposition or "similar, sounds like" words.

** UPDATE **

Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:

$indexer = Zend_Search_Lucene::create('/path/to/lucene/data');

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
// change this option for your need
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

$rowSet = getDataRowSet(); // perform your SQL query to fetch whatever you need to index
foreach ($rowSet as $row) {
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8'))
->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8'))
->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable))
->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8'))
;
$indexer->addDocument($doc);
}

// ... you can get as many $rowSet as you want and create as many documents
// as you wish... each document doesn't necessarily need the same fields...
// Lucene is pretty flexible on this

$indexer->optimize(); // do this every time you add more data to you indexer...
$indexer->commit(); // finalize the process

Then, this is basically how you search (basic search) :

$index = Zend_Search_Lucene::open('/path/to/lucene/data');

// same search options
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive()
);

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');

$query = 'php +field1:foo'; // search for the word 'php' in any field,
// +search for 'foo' in field 'field1'

$hits = $index->find($query);

$numHits = count($hits);
foreach ($hits as $hit) {
$score = $hit->score; // the hit weight
$field1 = $hit->field1;
// etc.
}

Here are great sites about Lucene in Java, PHP, and .Net.

In conclusion each search methods have their own pros and cons :

  • You mentioned Sphinx search and it looks very good, as long as you can make the deamon run on your web host.
  • Zend Lucene requires a cron job to re-index the database. While it is quite transparent to the user, this means that any new data (or deleted data!) is not always in sync with the data in your database and therefore won't show up right away on user search.
  • MySQL FULLTEXT search is good and fast, but will not give you all the power and flexibility of the first two.

Please feel free to comment if I have forgotten/missed anything.

How to find rows in SQL that start with the same string (similar rows)?

The first solution given will identify the key prefixes; extend it just a bit to get the table rows beginning with those keys:

SELECT * 
FROM TABLE
WHERE SUBSTRING(COLUMN, 0, CHARINDEX('~', COLUMN)) IN
(
SELECT SUBSTRING(COLUMN, 0, CHARINDEX('~', COLUMN)) FROM TABLE
GROUP BY SUBSTRING(COLUMN, 0, CHARINDEX('~', COLUMN))
HAVING COUNT(*) > 1
)

Alternately, you could use a join between a temp table containing the prefixes and the original table - if the number of prefixes becomes very large, using a "where in" can become very expensive.



Related Topics



Leave a reply



Submit