Use Soundex() Word by Word on SQL Server

Use SOUNDEX() word by word on SQL Server

Have you looked into the Full-Text Search feature in SQL Server? I know this is not exactly what you asked for. Its just that the SOUNDEX() function is used to find similar SOUNDING names (EX: SMITH and SMYTHE sound the same). In a search engine, however, how a word sounds is less important than the search words themselves. Full-Text Search also lets you use synonyms (allowing you to specify certain words that mean the same thing within your application's context), and have them automatically considered during your search.

Look at these pages for more information about Full Text Search in SQL Server:

Introduction to Full-Text Search

CONTAINS

CONTAINSTABLE

FREETEXT

FREETEXTTABLE

How to use SOUNDEX dynamically in SQL Server 2008

If you mean using different string values within your function, you can just use parameters within your query:

declare @Search nvarchar(50) = 'DIPAK';
select soundex(@Search);

If you want to compare this to values in a table, you will need to either calculate the soundex of your lookup values within your where or join conditions or add a column to your table that holds this value:

declare @SearchSoundex nvarchar(10) = soundex('DIPAK');

select *
from Names
where soundex(FirstName) = @SearchSoundex;

declare @SearchSoundex nvarchar(10) = soundex('DIPAK');

select *
from Names
where FirstNameSoundex = @SearchSoundex;

Find SOUNDEX of each word in the column

You need split the name(s) into their respective parts and the "remerge" them. SQL Server 2008 (which is almost entirely out of support, so you should be looking at your upgrade plans) doesn't have a splitter built in. SQL Server 2016+ does, however, it doesn't provide ordinal position; therefore I have used DelimitedSplit8K (A google will find this). If you are using 2012+, I would recommend DelimitedSplit8K_LEAD (even on 2016+, as ordinal position is important):

WITH VTE AS(
SELECT *
FROM (VALUES('Mr.Alex James Henrry'),
('Mr John Desto'),
('Ms.Lisa Jack Jerry Han'),
('Smith White'),
('Rowny James Duest')) V([Name]))
SELECT [name],
STUFF((SELECT ' ' + SOUNDEX(DS.item)
FROM dbo.DelimitedSplit8K(REPLACE([name],'.',' '),' ') DS
WHERE DS.item NOT IN ('Mr','Mrs','Miss','...') --You know what your acceptable titles are
--Although, seeing as you have both "Mr {name}" and Mr.{name}", maybe not :/
ORDER BY DS.itemnumber
FOR XML PATH('')),1,1,'') AS name_soundex
FROM VTE;

Selecting words out of table which sound similar

SoundEx()

SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken.

Difference()

Returns an integer value that indicates the difference between the SOUNDEX values of two character expressions.

SELECT word
, SoundEx(word) As word
, SoundEx(word_that_sounds_similar) As word_that_sounds_similar
, Difference(SoundEx(word), SoundEx(word_that_sounds_similar)) As how_similar
FROM wordtable
WHERE Difference(SoundEx(word), SoundEx(word_that_sounds_similar)) <= 1 /* quite close! */

The value returned by Difference() indicates how similar the two words are.

A value of 0 indicates a strong match and a value of 4 means slim-to-no match.

SQL find Word in String (any Spelling)

Convert it to upper case before comparing

SELECT * FROM dreams
WHERE
upper(title)
LIKE '%LO%'

How does the Soundex function work in SQL Server?

Take a look a this article

The first letter of the code corresponds to the first letter of the
name. The remainder of the code consists of three digits derived from
the syllables of the word according to the following code:

  • 1 = B, F, P, V
  • 2 = C, G, J, K, Q, S, X, Z
  • 3 = D, T
  • 4 = L
  • 5 = M,N
  • 6 = R

The double letters with the same Soundex code, A, E, I, O, U, H, W, Y,
and some prefixes are being disregarded...

So for Smith and Smythe the code is created like this:

S  S   ->   S
m m -> 5
i y -> 0
t t -> 3
h h -> 0
e -> -

SQL Server soundex and MySQL soundex difference

From the MySQL documentation:

A standard soundex string is four characters long, but the SOUNDEX() function returns an arbitrarily long string. You can use SUBSTRING() on the result to get a standard soundex string.

The version that MSSQL returns is the normal length, MySQL is just being different for some reason.

Also looks like MSSQL is only looking at the first word. With a third soundex implementation, SOUNDEX('game') gives G500 and SOUNDEX('game of thrones') gives G513.

PATINDEX with SOUNDEX

Here's one option, not sure how it would perform with 100 million records considering all that you need to do. You'll have to test that out.

At a high level how I understand this is you basically need

  • Search all words in a string based on the words of another string
  • Returning the character starting position in the original string where that word equals or sounds like the search word.

You can use DIFFERENCE() for the comparison:

DIFFERENCE compares two different SOUNDEX values, and returns an
integer value. This value measures the degree that the SOUNDEX values
match, on a scale of 0 to 4. A value of 0 indicates weak or no
similarity between the SOUNDEX values; 4 indicates strongly similar,
or even identically matching, SOUNDEX values.

You'll need to split the string based on the space ' ' and since you're 2008 you'd have to roll your own function.

I used the XML function from here, https://sqlperformance.com/2012/07/t-sql-queries/split-strings, for my examples, you'll obviously need to adjust if you have your own or want to use something different:

CREATE FUNCTION dbo.SplitStrings_XML
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')
FROM
(
SELECT x = CONVERT(XML, '<i>'
+ REPLACE(@List, @Delimiter, '</i><i>')
+ '</i>').query('.')
) AS a CROSS APPLY x.nodes('i') AS y(i)
);
GO

I switched and use table variables to show the example, I would suggest not doing that with the amount of data you have and create and use physical tables.

Option 1 - Not dynamic:

DECLARE @tbl_pat_soundex TABLE
(
[col_str] VARCHAR(MAX)
);

INSERT INTO @tbl_pat_soundex
VALUES ( 'Smith A Steve' )
,( 'Steve A Smyth' )
,( 'A Smeeth Stive' )
,( 'Steve Smith A' )
,( 'Smit Steve A' )

SELECT DISTINCT [aa].[col_str]
, MAX([aa].[Smith]) OVER ( PARTITION BY [aa].[col_str] ) AS [Smith]
, MAX([aa].[A]) OVER ( PARTITION BY [aa].[col_str] ) AS [A]
, MAX([aa].[Steve]) OVER ( PARTITION BY [aa].[col_str] ) AS [Steve]
FROM (
SELECT [a].[col_str]
, CASE WHEN DIFFERENCE([b].[item], 'Smith') = 4 THEN
CHARINDEX([b].[item], [a].[col_str])
ELSE 0
END AS [Smith]
, CASE WHEN DIFFERENCE([b].[item], 'A') = 4 THEN
CHARINDEX([b].[item], [a].[col_str])
ELSE 0
END AS [A]
, CASE WHEN DIFFERENCE([b].[item], 'Steve') = 4 THEN
CHARINDEX([b].[item], [a].[col_str])
ELSE 0
END AS [Steve]
FROM @tbl_pat_soundex [a]
CROSS APPLY [dbo].[SplitStrings_XML]([a].[col_str], ' ') [b]
) AS [aa];
  • Using the function we split the string into individual words
  • Then we use a case statement to check the DIFFERENCE value
  • If that DIFFERENCE value equals 4 we then return the CHARINDEX value of the original word against string.
  • If doesn't equal we return 0

Then from there it's a matter of getting the max value of each based on the original string:

          , MAX([aa].[Smith]) OVER ( PARTITION BY [aa].[col_str] ) AS [Smith]
, MAX([aa].[A]) OVER ( PARTITION BY [aa].[col_str] ) AS [A]
, MAX([aa].[Steve]) OVER ( PARTITION BY [aa].[col_str] ) AS [Steve]

To get you your final results:

Sample Image

Option 2 - Dynamic with a pivot:

We'll declare the string we want to search, split that out and search for those individuals words in the original string and then pivot the results.

--This example is using global temp tables as it's showing how
--to build a dynamic pivot
IF OBJECT_ID('tempdb..##tbl_pat_soundex') IS NOT NULL
DROP TABLE [##tbl_pat_soundex];

IF OBJECT_ID('tempdb..##tbl_col_str_SearchString') IS NOT NULL
DROP TABLE [##tbl_col_str_SearchString];

CREATE TABLE [##tbl_pat_soundex]
(
[col_str] VARCHAR(MAX)
);

INSERT INTO [##tbl_pat_soundex]
VALUES ( 'Smith A Steve' )
, ( 'Steve A Smyth' )
, ( 'A Smeeth Stive' )
, ( 'Steve Smith A' )
, ( 'Smit Steve A' );

--What are you searching for?
DECLARE @SearchString NVARCHAR(200);
SET @SearchString = N'Smith A Steve';

--We build a table we load with every combination of the words from the string and the words from the SearchString for easier comparison.
CREATE TABLE [##tbl_col_str_SearchString]
(
[col_str] NVARCHAR(MAX)
, [col_str_value] NVARCHAR(MAX)
, [SearchValue] NVARCHAR(200)
);

--Load that table for comparison
--split our original string into individual words
--also split our search string into individual words and give me all combinations.
INSERT INTO [##tbl_col_str_SearchString] (
[col_str]
, [col_str_value]
, [SearchValue]
)
SELECT DISTINCT [a].[col_str]
, [b].[item]
, [c].[item]
FROM [##tbl_pat_soundex] [a]
CROSS APPLY [dbo].[SplitStrings_XML]([a].[col_str], ' ') [b]
CROSS APPLY [dbo].[SplitStrings_XML](@SearchString, ' ') [c]
ORDER BY [a].[col_str];

--Then we can easily compare each word and search word for those that match or sound alike using DIFFERNCE()
SELECT [col_str], [col_str_value], [SearchValue], CASE WHEN DIFFERENCE([col_str_value], [SearchValue]) = 4 THEN CHARINDEX([col_str_value], [col_str]) ELSE 0 END AS [Match] FROM ##tbl_col_str_SearchString

--Then we can pivot on it
--and we will need to make it dynamic since we are not sure what what @SearchString could be.
DECLARE @PivotSQL NVARCHAR(MAX);
DECLARE @pivotColumn NVARCHAR(MAX);

SET @pivotColumn = N'[' + REPLACE(@SearchString, ' ', '],[') + N']';

SET @PivotSQL = N'SELECT * FROM (
SELECT [col_str], [SearchValue], CASE WHEN DIFFERENCE([col_str_value], [SearchValue]) = 4 THEN CHARINDEX([col_str_value], [col_str]) ELSE 0 END AS [Match] FROM ##tbl_col_str_SearchString
) aa
PIVOT (MAX([Match]) FOR [SearchValue] IN (' + @pivotColumn
+ N')) AS MaxMatch
ORDER BY [MaxMatch].[col_str]
';

--Giving us the final results.
EXEC sp_executesql @PivotSQL


Related Topics



Leave a reply



Submit