SQL Server Fuzzy Search with Percentage of match
This is how I could accomplish this:
Explained further @ SQL Server Fuzzy Search - Levenshtein Algorithm
Create below file using any editor of your choice:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
public partial class StoredFunctions
{
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic = true, IsPrecise = false)]
public static SqlDouble Levenshtein(SqlString stringOne, SqlString stringTwo)
{
#region Handle for Null value
if (stringOne.IsNull)
stringOne = new SqlString("");
if (stringTwo.IsNull)
stringTwo = new SqlString("");
#endregion
#region Convert to Uppercase
string strOneUppercase = stringOne.Value.ToUpper();
string strTwoUppercase = stringTwo.Value.ToUpper();
#endregion
#region Quick Check and quick match score
int strOneLength = strOneUppercase.Length;
int strTwoLength = strTwoUppercase.Length;
int[,] dimention = new int[strOneLength + 1, strTwoLength + 1];
int matchCost = 0;
if (strOneLength + strTwoLength == 0)
{
return 100;
}
else if (strOneLength == 0)
{
return 0;
}
else if (strTwoLength == 0)
{
return 0;
}
#endregion
#region Levenshtein Formula
for (int i = 0; i <= strOneLength; i++)
dimention[i, 0] = i;
for (int j = 0; j <= strTwoLength; j++)
dimention[0, j] = j;
for (int i = 1; i <= strOneLength; i++)
{
for (int j = 1; j <= strTwoLength; j++)
{
if (strOneUppercase[i - 1] == strTwoUppercase[j - 1])
matchCost = 0;
else
matchCost = 1;
dimention[i, j] = System.Math.Min(System.Math.Min(dimention[i - 1, j] + 1, dimention[i, j - 1] + 1), dimention[i - 1, j - 1] + matchCost);
}
}
#endregion
// Calculate Percentage of match
double percentage = System.Math.Round((1.0 - ((double)dimention[strOneLength, strTwoLength] / (double)System.Math.Max(strOneLength, strTwoLength))) * 100.0, 2);
return percentage;
}
};
Name it levenshtein.cs
Go to Command Prompt. Go to the file directory of levenshtein.cs then call csc.exe /t: library /out: UserFunctions.dll levenshtein.cs you may have to give the full path of csc.exe from NETFrameWork 2.0.
Once your DLL is ready. Add it to the assemblies Database>>Programmability>>Assemblies>> New Assembly.
Create function in your database:
CREATE FUNCTION dbo.LevenshteinSVF
(
@S1 NVARCHAR(200) ,
@S2 NVARCHAR(200)
)
RETURNS FLOAT
AS EXTERNAL NAME
UserFunctions.StoredFunctions.Levenshtein
GO
In my case I had to enable clr:
sp_configure 'clr enabled', 1
GO
reconfigure
GO
Test the function:
SELECT dbo.LevenshteinSVF('James','James Bond')
Result: 50 % match
Fuzzy string matching SQL - words in different order
I have not found anything that could measure the shuffling of words in a string. For a shuffling of letters I ended up using this answer: https://stackoverflow.com/a/26389197/1903793
CREATE ASSEMBLY [FuzzyString]
FROM 0x4D5A90000300000004000000FFFF0000B800000000000000400000000000000000000000000000000000000000000000000000000000000000000000800000000E1FBA0E00B409CD21B8014CCD21546869732070726F6772616D2063616E6E6F742062652072756E20696E20444F53206D6F64652E0D0D0A2400000000000000504500004C010300BBB08A5A0000000000000000E00022200B013000000C000000060000000000007A2B0000002000000040000000000010002000000002000004000000000000000600000000000000008000000002000000000000030060850000100000100000000010000010000000000000100000000000000000000000282B00004F000000004000009003000000000000000000000000000000000000006000000C000000F02900001C0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000200000080000000000000000000000082000004800000000000000000000002E74657874000000800B000000200000000C000000020000000000000000000000000000200000602E72737263000000900300000040000000040000000E0000000000000000000000000000400000402E72656C6F6300000C00000000600000000200000012000000000000000000000000000040000042000000000000000000000000000000005C2B00000000000048000000020005006022000090070000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000013300800F901000001000011000F00281000000A130711072C0C0F007201000070281100000A0F01281000000A130811082C0C0F017201000070281100000A0F00281200000A6F1300000A0A0F01281200000A6F1300000A0B066F1400000A0C076F1400000A0D081758091758731500000A130416130508095816FE01130911092C1600230000000000005940281600000A130A38690100000816FE01130B110B2C1600230000000000000000281600000A130A38490100000916FE01130C110C2C1600230000000000000000281600000A130A382901000016130D2B121104110D16110D281700000A110D1758130D110D08FE0216FE01130E110E2DE016130F2B12110416110F110F281700000A110F1758130F110F09FE0216FE01131011102DE0171311388C000000001713122B710006111117596F1800000A07111217596F1800000AFE01131311132C051613052B031713051104111111121104111117591112281900000A17581104111111121759281900000A1758281A00000A11041111175911121759281900000A110558281A00000A281700000A00111217581312111209FE0216FE01131411142D8100111117581311111108FE0216FE01131511153A63FFFFFF23000000000000F03F11040809281900000A6C0809281B00000A6C5B592300000000000059405A18281C00000A13061106281600000A130A2B00110A2A2202281D00000A002A000042534A4201000100000000000C00000076342E302E33303331390000000005006C0000005C020000237E0000C80200001C03000023537472696E677300000000E40500000400000023555300E8050000100000002347554944000000F80500009801000023426C6F620000000000000002000001471502080900000000FA01330016000001000000150000000200000002000000020000001D0000000F000000010000000100000001000000020000000000F101010000000000060028019E02060095019E02060047006C020F00BE02000006006F00270206000B0127020600D700270206007C01270206004801270206006101270206008600270206005B007F02060039007F020600BA0027020600A100BD010600FC020C020A00F6004B020A002500CD020A00D701CD020600DA010C020600E1010C02000000000100000000000100010001001000E202000041000100010050200000000096001702810001005522000000008618660206000300000001002F00000002003902090066020100110066020600190066020A00290066021000310066021000390066021000410066021000490066021000510066021000590066021000610066021500690066021000710066021000790066021000890066020600990001023A009900660210009900B3013E00A10043023E00A100E60142000C0066024E0091000B0354000C0007035A00A100F20261000C0003036600A90013026C00A90017036C00A9001F00720081006602060020007B0071012E000B008A002E00130093002E001B00B2002E002300BB002E002B00CE002E003300CE002E003B00CE002E004300BB002E004B00D4002E005300CE002E005B00CE002E006300EC002E006B0016012E00730023011A0046000480000001000000000000000000000000001B020000040000000000000000000000780016000000000004000000000000000000000078000A00000000000000003C4D6F64756C653E0053797374656D2E44617461006D73636F726C696200526F756E640053716C446F75626C6500737472696E674F6E6500477569644174747269627574650044656275676761626C6541747472696275746500436F6D56697369626C6541747472696275746500417373656D626C795469746C6541747472696275746500417373656D626C7954726164656D61726B417474726962757465005461726765744672616D65776F726B41747472696275746500417373656D626C7946696C6556657273696F6E41747472696275746500417373656D626C79436F6E66696775726174696F6E4174747269627574650053716C46756E6374696F6E41747472696275746500417373656D626C794465736372697074696F6E41747472696275746500436F6D70696C6174696F6E52656C61786174696F6E7341747472696275746500417373656D626C7950726F6475637441747472696275746500417373656D626C79436F7079726967687441747472696275746500417373656D626C79436F6D70616E794174747269627574650052756E74696D65436F6D7061746962696C697479417474726962757465006765745F56616C75650053797374656D2E52756E74696D652E56657273696F6E696E670053716C537472696E67004D617468006765745F4C656E677468004C6576656E73687465696E2E646C6C006765745F49734E756C6C0053797374656D004D696E004861426F4C6576656E73687465696E0053797374656D2E5265666C656374696F6E00737472696E6754776F00546F5570706572004D6963726F736F66742E53716C5365727665722E536572766572002E63746F720053797374656D2E446961676E6F73746963730053797374656D2E52756E74696D652E496E7465726F7053657276696365730053797374656D2E52756E74696D652E436F6D70696C6572536572766963657300446562756767696E674D6F6465730053797374656D2E446174612E53716C54797065730053746F72656446756E6374696F6E73006765745F4368617273004F626A6563740047657400536574006F705F496D706C69636974004D61780000000100008C1A28518DAA994B969F8C9B2C0CD20400042001010803200001052001011111042001010E04200101021F07160E0E080814080200020000080D02020211490202080208020808020202032000020320000E03200008071408020002000005200201080805000111490D0620030108080804200103080520020808080500020808080500020D0D0808B77A5C561934E0890800021149114D114D0801000800000000001E01000100540216577261704E6F6E457863657074696F6E5468726F7773010801000701000000001201000D436C6173734C69627261727931000005010000000017010012436F7079726967687420C2A920203230313800002901002465356266373439622D363661392D343637332D396233332D39616639656462383961663100000C010007312E302E302E3000004D01001C2E4E45544672616D65776F726B2C56657273696F6E3D76342E362E310100540E144672616D65776F726B446973706C61794E616D65142E4E4554204672616D65776F726B20342E362E31240100020054020F497344657465726D696E69737469630154020949735072656369736500000000000000BBB08A5A00000000020000001C0100000C2A00000C0C0000525344536AF89DEC4586C4488693EFBD73C73D1E01000000433A5C315C53514C5C646C6C5C436C6173734C696272617279315C436C6173734C696272617279315C6F626A5C44656275675C4C6576656E73687465696E2E7064620000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000502B000000000000000000006A2B00000020000000000000000000000000000000000000000000005C2B0000000000000000000000005F436F72446C6C4D61696E006D73636F7265652E646C6C0000000000FF2500200010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100100000001800008000000000000000000000000000000100010000003000008000000000000000000000000000000100000000004800000058400000340300000000000000000000340334000000560053005F00560045005200530049004F004E005F0049004E0046004F0000000000BD04EFFE00000100000001000000000000000100000000003F000000000000000400000002000000000000000000000000000000440000000100560061007200460069006C00650049006E0066006F00000000002400040000005400720061006E0073006C006100740069006F006E00000000000000B00494020000010053007400720069006E006700460069006C00650049006E0066006F0000007002000001003000300030003000300034006200300000001A000100010043006F006D006D0065006E007400730000000000000022000100010043006F006D00700061006E0079004E0061006D006500000000000000000044000E000100460069006C0065004400650073006300720069007000740069006F006E000000000043006C006100730073004C0069006200720061007200790031000000300008000100460069006C006500560065007200730069006F006E000000000031002E0030002E0030002E003000000040001000010049006E007400650072006E0061006C004E0061006D00650000004C006500760065006E00730068007400650069006E002E0064006C006C0000004800120001004C006500670061006C0043006F007000790072006900670068007400000043006F0070007900720069006700680074002000A90020002000320030003100380000002A00010001004C006500670061006C00540072006100640065006D00610072006B00730000000000000000004800100001004F0072006900670069006E0061006C00460069006C0065006E0061006D00650000004C006500760065006E00730068007400650069006E002E0064006C006C0000003C000E000100500072006F0064007500630074004E0061006D0065000000000043006C006100730073004C0069006200720061007200790031000000340008000100500072006F006400750063007400560065007200730069006F006E00000031002E0030002E0030002E003000000038000800010041007300730065006D0062006C0079002000560065007200730069006F006E00000031002E0030002E0030002E00300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002000000C0000007C3B00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
WITH PERMISSION_SET = SAFE
GO
CREATE FUNCTION [dbo].[Levenshtein](@S1 [nvarchar](200), @S2 [nvarchar](200))
RETURNS [float] WITH EXECUTE AS CALLER
AS
EXTERNAL NAME [FuzzyString].[StoredFunctions].[HaBoLevenshtein]
GO
Example how to use it:
select [dbo].[Levenshtein] ('Apple', 'Appleee')
Fuzzy matching a string in SQL
In postgres you can use fuzzystrmatch package. It provies a levenshtein
function, that returns distance between two texts, you can then perform fuzzy matching with the following exemplary predicate:
where levenshtein(street_address, '123 Main Avex') <= 1
This will match all records, because the distance between '123 Main Ave' and '123 Main Avex' is 1 (1 insertion).
Of course, value 1
here is just an example and will perform matching quite strictly (difference by only one character). You should either use larger number or, what @IVO GELOV sugests - use relative distance (distance divided by the length).
SQL Fuzzy Matching
A rather quick domain specific solution may be to calculate a string similarity using SOUNDEX and a numeric distance between 2 strings. This will only really help when you have a lot of product codes.
Using a simple UDF like below you can extract the numeric chars from a string so that you can then get 2200 out of 'CLC 2200npk' and 1100 out of 'CLC 1100' so you can now determine closeness based on the SOUNDEX output of each input as well as closeness of the numeric component of each input.
CREATE Function [dbo].[ExtractNumeric](@input VARCHAR(1000))
RETURNS INT
AS
BEGIN
WHILE PATINDEX('%[^0-9]%', @input) > 0
BEGIN
SET @input = STUFF(@input, PATINDEX('%[^0-9]%', @input), 1, '')
END
IF @input = '' OR @input IS NULL
SET @input = '0'
RETURN CAST(@input AS INT)
END
GO
As far as general purpose algorithms go there are a couple which might help you with varying degrees of success depending on data set size and performance requirements. (both links have TSQL implementations available)
- Double Metaphone - This algo will give you a better match than soundex at the cost of speed it is really good for spelling correction though.
- Levenshtein Distance - This will calculate how many keypresses it would take to turn one string into another for instance to get from 'CLC 2200npk' to 'CLC 2200' is 3, while from 'CLC 2200npk' to 'CLC 1100' is 5.
Here is an interesting article which applies both algos together which may give you a few ideas.
Well hopefully some of that helps a little.
EDIT: Here is a much faster partial Levenshtein Distance implementation (read the post it wont return exact same results as the normal one). On my test table of 125000 rows it runs in 6 seconds compared to 60 seconds for the first one I linked to.
fuzzy matching in sql
There is no simple answer to this and some algorithms are available which may need the development of a CLR function. There is a good discussion in this question and it's answers.
Related Topics
Create Unqiue Case-Insensitive Constraint on Two Varchar Fields
Update Columns with Null Values
How to Calculate Running Multiplication
What Would Be the Best Way to Store Records Order in SQL
Ssrs - Keep a Table the Same Width When Hiding Columns Dynamically
Reason for System.Transactions.Transactionindoubtexception
Add a Column That Represents a Concatenation of Two Other Varchar Columns
Ms SQL Server: Check to See If a User Can Execute a Stored Procedure
Efficiently Include Column Not in Group by of SQL Query
Rounding Issue in Log and Exp Functions
Connect by or Hierarchical Queries in Rdbms Other Than Oracle
Preserve Parent-Child Relationships When Copying Hierarchical Data
Rounding to 2 Decimal Places in SQL