How to Compare If Two Strings Contain the Same Words in T-SQL for SQL Server 2008

comparing two strings in SQL Server

There is no direct string compare function in SQL Server

CASE
WHEN str1 = str2 THEN 0
WHEN str1 < str2 THEN -1
WHEN str1 > str2 THEN 1
ELSE NULL --one of the strings is NULL so won't compare (added on edit)
END

Notes

  • you can wraps this via a UDF using CREATE FUNCTION etc
  • you may need NULL handling (in my code above, any NULL will report 1)
  • str1 and str2 will be column names or @variables

SQL Compare Characters in two strings count total identical

You could try and create a script something like this php script to help you:

$words = array();
$duplicates = array();

function _compare($value, $key, $array) {
global $duplicates;

$diff = array_diff($array, $value);

if (!empty($diff)) {
$duplicates[$key] = array_keys($diff);
}

return $diff;
}

$mysqli = new mysqli('localhost', 'username', 'password', 'database');
$query = "SELECT id, business_name FROM table";

if ($result = $mysqli->query($query)) {
while ($row = $result->fetch_object()) {
$pattern = '#[^\w\s]+#i';
$row->business_name = preg_replace($pattern, '', $row->business_name);
$_words = explode(' ', $row->business_name);
$diff = array_walk($words, '_compare', $_words);
$words[$row->id][] = $_words;

$result->close();
}
}

$mysqli->close();

This is not tested but you need something like this, because I don't think this is possible with SQL alone.

---------- EDIT ----------

Or you could do a research on what the guys in the comment recommend Levenshtein distance in T-SQL

Hope it helps, good luck!

TSQL function to compare two strings

Not really sure what you are looking for. From your question, I understand that you need to check 2 email addresses for similarity / dissimilarity.

Why can you not use this?

declare @email1 varchar(100) set @email1 = 'billg@microsoft.com'
declare @email2 varchar(100) set @email2 = 'melinda@microsoft.com'
IF
@email1=@email2
BEGIN
PRINT 'Same Email'
END
ELSE
BEGIN
PRINT 'Not Same Email'
END

Raj

Compare Two Strings For Common Value

If your DB version is 2016+, then you can create queries containing STRING_SPLIT() function with CROSS APPLY next to each of your tables, and then filter common values through INTERSECT operator :

SELECT value
FROM tab1
CROSS APPLY STRING_SPLIT(str, ' ')
INTERSECT
SELECT value
FROM tab2
CROSS APPLY STRING_SPLIT(str, ' ')

Demo

which yields case-insensitive matching among splitted words.

Find sql records containing similar strings

If you really want to define similarity in the exact way that you have formulated in your question, then you would - as you say - have to implement the Levensthein Distance calculation. Either in code calculated on each row retrieved by a DataReader or as a SQL Server function.

The problem stated is actually more tricky than it may appear at first sight, because you cannot assume to know what the mutually shared elements between two strings may be.

So in addition to Levensthein Distance you probably also want to specify a minimum number of consecutive characters that actually have to match (in order for sufficient similarity to be concluded).

In sum: It sounds like an overly complicated and time consuming/slow approach.

Interestingly, in SQL Server 2008 you have the DIFFERENCE function which may be used for something like this.

It evaluates the phonetic value of two strings and calculates the difference. I'm unsure if you will get it to work properly for multi-word expressions such as movie titles since it doesn't deal well with spaces or numbers and puts too much emphasis on the beginning of the string, but it is still an interesting predicate to be aware of.

If what you are actually trying to describe is some sort of search feature, then you should look into the Full Text Search capabilities of SQL Server 2008. It provides built-in Thesaurus support, fancy SQL predicates and a ranking mechanism for "best matches"

EDIT: If you are looking to eliminate duplicates maybe you could look into SSIS Fuzzy Lookup and Fuzzy Group Transformation. I have not tried this myself, but it looks like a promising lead.

EDIT2: If you don't want to dig into SSIS and still struggle with the performance of the Levensthein Distance algorithm, you could perhaps try this algorithm which appears to be less complex.

I need to identify strings, in sql server, that contain the same keywords as a given string in no particular order

Your approach is very much "row based". Here is a set based approach, less code, better maintenance and faster...

DECLARE @forbiddenWords TABLE(item VARCHAR(100));
INSERT INTO @forbiddenWords VALUES ('&'),( 'a'),( 'and'),( 'at'),( 'by'),( 'can'),( 'for'),( 'if'),( 'in'),( 'is'),( 'it'),( 'of'),( 'on'),( 'or'),( 'the'),( 'this'),( 'to'),( 'too'),( 'verizon'),( 'with'),( 'your')

DECLARE @breakingCharacters TABLE(item VARCHAR(100));
INSERT INTO @breakingCharacters VALUES(':'),(';'),(','),('!'),('-'),('?'),('.'),('%'),('$'),('&'),('£'),('"');

DECLARE @Phrase1 VARCHAR(MAX)='This is a text where I try to find similar words. Let''s see if it works!';
DECLARE @Phrase2 VARCHAR(MAX)='This is another text where I use some words of Phrase1 to check their similarity!';

--Replace all breaking Characters
SELECT @Phrase1=REPLACE(@Phrase1,item,' ')
FROM @breakingCharacters;

SELECT @Phrase2=REPLACE(@Phrase2,item,' ')
FROM @breakingCharacters;

WITH Splitted AS
(
SELECT CAST('<x>' + REPLACE(LOWER(@Phrase1),' ','</x><x>') + '</x>' AS xml) AS Phrase1AsXml
,CAST('<x>' + REPLACE(LOWER(@Phrase2),' ','</x><x>') + '</x>' AS xml) AS Phrase2AsXml
)
,Phrase1AsFilteredWords AS
(
SELECT DISTINCT The.word.value('.','varchar(max)') AS OneWord
FROM Splitted
CROSS APPLY Phrase1AsXml.nodes('/x') AS The(word)
WHERE LEN(The.word.value('.','varchar(max)'))>0
AND NOT EXISTS(SELECT * FROM @forbiddenWords AS fw WHERE fw.item = The.word.value('.','varchar(max)') )
)
,Phrase2AsFilteredWords AS
(
SELECT DISTINCT The.word.value('.','varchar(max)') AS OneWord
FROM Splitted
CROSS APPLY Phrase2AsXml.nodes('/x') AS The(word)
WHERE LEN(The.word.value('.','varchar(max)'))>0
AND NOT EXISTS(SELECT * FROM @forbiddenWords AS fw WHERE fw.item = The.word.value('.','varchar(max)') )
)
,CommonWords AS
(
SELECT p1.OneWord
FROM Phrase1AsFilteredWords AS p1
INNER JOIN Phrase2AsFilteredWords AS p2 ON p1.OneWord=p2.OneWord
)
,WordCounter AS
(
SELECT
(SELECT COUNT(*) FROM Phrase1AsFilteredWords) AS CountPhrase1
,(SELECT COUNT(*) FROM Phrase2AsFilteredWords) AS CountPhrase2
,(SELECT COUNT(*) FROM CommonWords) AS CountCommon
)
SELECT WordCounter.*
,(CountCommon*100) / CountPhrase1 AS Phrase1PC
,(CountCommon*100) / CountPhrase2 AS Phrase2PC
,STUFF((
SELECT ', ' + OneWord
FROM CommonWords
FOR XML PATH('')
),1,2,'') AS CommonWords
FROM WordCounter

The result :

CountPhrase1    CountPhrase2    CountCommon Phrase1PC   Phrase2PC   CommonWords
10 11 4 40 36 i, text, where, words

One hint: If you compare many with many it will cost a lot to do the calculation again and again. I'd advise you to prepare all phrases in one go and compare these prepared results...

One more hint: If you do this more often and your phrases don't change, it could be clever to store the preparated word list permanently.

Happy coding!

T-SQL - compare strings char by char

for columns in table you don't want to use row by row approach, try this one:

with cte(n) as (
select 1
union all
select n + 1 from cte where n < 9
)
select
t.s1, t.s2,
sum(
case
when substring(t.s1, c.n, 1) <> substring(t.s2, c.n, 1) then 1
else 0
end
) as diff
from test as t
cross join cte as c
group by t.s1, t.s2

=>sql fiddle demo



Related Topics



Leave a reply



Submit