Sql Fuzzy Join - Mssql

SQL Fuzzy Join - MSSQL

Here is how this could be done using Levenshtein Distance:

Create this function:(Execute this first)

CREATE FUNCTION ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int
DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

SELECT
@s1_len = LEN(@s1),
@s2_len = LEN(@s2),
@cv1 = 0x0000,
@j = 1, @i = 1, @c = 0

WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1

WHILE @i <= @s1_len
BEGIN
SELECT
@s1_char = SUBSTRING(@s1, @i, 1),
@c = @i,
@cv0 = CAST(@i AS binary(2)),
@j = 1

WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END

SELECT @cv1 = @cv0, @i = @i + 1
END

RETURN @c
END

(Function developped by Joseph Gama)

And then simply use this query to get matches

SELECT A.Customer,
b.ID,
b.Customer
FROM #POTENTIALCUSTOMERS a
LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;

Complete Script after you create that function:

IF OBJECT_ID('tempdb..#ExistingCustomers') IS NOT NULL
DROP TABLE #ExistingCustomers;

CREATE TABLE #ExistingCustomers
(Customer VARCHAR(255),
ID INT
);

INSERT INTO #ExistingCustomers
VALUES
('Ed''s Barbershop',
1002
);

INSERT INTO #ExistingCustomers
VALUES
('GroceryTown',
1003
);

INSERT INTO #ExistingCustomers
VALUES
('Candy Place',
1004
);

INSERT INTO #ExistingCustomers
VALUES
('Handy Man',
1005
);

IF OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') IS NOT NULL
DROP TABLE #POTENTIALCUSTOMERS;

CREATE TABLE #POTENTIALCUSTOMERS(Customer VARCHAR(255));

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Eds Barbershop');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Grocery Town');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Candy Place');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Handee Man');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Beauty Salon');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('The Apple Farm');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Igloo Ice Cream');

INSERT INTO #POTENTIALCUSTOMERS
VALUES('Ride-a-Long Bikes');

SELECT A.Customer,
b.ID,
b.Customer
FROM #POTENTIALCUSTOMERS a
LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 5;

Here you can find a T-SQL example at http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx

fuzzy matching in sql

There is no simple answer to this and some algorithms are available which may need the development of a CLR function. There is a good discussion in this question and it's answers.

Fuzzy phrase similarity in SQL Server

A relative fast method to quickly find the best matching string using fuzzy matching logic can be based on counting matching 3-grams in the strings.

It may utilize pre-built sql functions and indexed table which should speed up the search. In particular it does not have to check distance from search string to every string in the data set.

First, for convenience, create a table function that breaks strings into 3-letter tokens.

drop function dbo.get_triplets;
go
CREATE FUNCTION dbo.get_triplets
(
@data varchar(1000)
)
RETURNS TABLE AS RETURN
(
WITH Nums AS
(
SELECT n = ROW_NUMBER() OVER (ORDER BY [object_id]) FROM sys.all_objects
)
select triplet,count(*) c, len(@data)-2 triplet_count
from (
select SUBSTRING(@data,n,3) triplet
from (select top (len(@data)-2) n from nums) n
) triplets
group by triplet
)
GO

Create a string dataset

drop table if exists #data;
select * into #data
from (
values
(1, 'the quick brown fox jumps over the lazy dog'),
(2, 'the quick brown cat jumps over the lazy frog'),
(3, 'the lazy quick brown frog jumps over the cat'),
(4, 'lorem ipsum dolor sit amet')
) a(id,data);

Create an indexed table of 3-letter tokens

drop table if exists #triplets;
select id,triplet,c,triplet_count data_triplet_count
into #triplets
from #data d
cross apply dbo.get_triplets(d.data);

CREATE unique CLUSTERED INDEX IX_triplet_index ON #triplets(triplet,id);

Then I would expect an efficient fuzzy search of a match to a given string using query similar to

declare @string_to_search varchar(1000) = 'the quick brown ox jumps over a lazy dog';

select matched.*,d.data,
cast(
cast(matched_triplets as float)
/
cast(case when data_triplet_count>string_triplet_count
then data_triplet_count
else string_triplet_count
end as float)
as decimal(4,3)) score
from (
select id,sum(case when a.c<b.c then a.c else b.c end) matched_triplets,
max(a.data_triplet_count) data_triplet_count,
max(b.triplet_count) string_triplet_count
from #triplets a
join dbo.get_triplets(@string_to_search) b
on a.triplet = b.triplet
group by id
) matched
join #data d
on d.id = matched.id;

SQL Fuzzy Matching

A rather quick domain specific solution may be to calculate a string similarity using SOUNDEX and a numeric distance between 2 strings. This will only really help when you have a lot of product codes.

Using a simple UDF like below you can extract the numeric chars from a string so that you can then get 2200 out of 'CLC 2200npk' and 1100 out of 'CLC 1100' so you can now determine closeness based on the SOUNDEX output of each input as well as closeness of the numeric component of each input.

CREATE Function [dbo].[ExtractNumeric](@input VARCHAR(1000))
RETURNS INT
AS
BEGIN
WHILE PATINDEX('%[^0-9]%', @input) > 0
BEGIN
SET @input = STUFF(@input, PATINDEX('%[^0-9]%', @input), 1, '')
END
IF @input = '' OR @input IS NULL
SET @input = '0'
RETURN CAST(@input AS INT)
END
GO

As far as general purpose algorithms go there are a couple which might help you with varying degrees of success depending on data set size and performance requirements. (both links have TSQL implementations available)

  • Double Metaphone - This algo will give you a better match than soundex at the cost of speed it is really good for spelling correction though.
  • Levenshtein Distance - This will calculate how many keypresses it would take to turn one string into another for instance to get from 'CLC 2200npk' to 'CLC 2200' is 3, while from 'CLC 2200npk' to 'CLC 1100' is 5.

Here is an interesting article which applies both algos together which may give you a few ideas.

Well hopefully some of that helps a little.

EDIT: Here is a much faster partial Levenshtein Distance implementation (read the post it wont return exact same results as the normal one). On my test table of 125000 rows it runs in 6 seconds compared to 60 seconds for the first one I linked to.



Related Topics



Leave a reply



Submit