Range Wildcard Pattern Matching Behaviour with Case-Sensitive Collations

Range wildcard pattern matching behaviour with case-sensitive collations

Unfortunately, the range operators are a bit funny. The range of letters from A-Z is:

AbBcCdDeE...yYzZ

That is, lower case characters immediately precede their upper case counterpart. This is also fun because if you want to deal with both upper and lower case characters, in a case sensitive collation, the range A-Z excludes lower case a.


I should say the above, regarding how the range expands out, is based on the collations I generally work with. How the range actually expands is collation dependent. If you can find a collation where, for instance, all upper case characters occur before all lower case characters, then the range would work as you expect. (Possibly one of the binary collations?)

How to join 2 fields of same data that are not case sensitive

Some databases support collation functionality where you can change how the comparisons are done. But the more general solution is to use lower() or upper():

SELECT u.email, u.UserName, ud.CustomerName, ud.CustomerAddress
FROM Userid u JOIN
Userdetails ud
ON LOWER(ud.Customer) = LOWER(u.Email);

Note: The use of a function in the on clause will generally impede performance. I would recommend that you change the collation on your database so comparisons are case-insensitive. Or change the data so it is all one case:

update userid
set email = lower(email)
where email <> lower(email);

Transact-SQL collate doesn't work

Try forcing a BINARY comparison so it goes bit by bit.

This guys link was helpful in solving your problem. LINK

I duplicated the 2 you were getting and with the COLLATE Latin1_General_BIN it returned the expected 9.

SELECT PATINDEX('%[A-Z].%', 'he.llo MA. asd ' COLLATE Latin1_General_BIN )

Is the LIKE operator case-sensitive with SQL Server?

It is not the operator that is case sensitive, it is the column itself.

When a SQL Server installation is performed a default collation is chosen to the instance. Unless explicitly mentioned otherwise (check the collate clause bellow) when a new database is created it inherits the collation from the instance and when a new column is created it inherits the collation from the database it belongs.

A collation like sql_latin1_general_cp1_ci_as dictates how the content of the column should be treated. CI stands for case insensitive and AS stands for accent sensitive.

A complete list of collations is available at https://msdn.microsoft.com/en-us/library/ms144250(v=sql.105).aspx

(a) To check a instance collation

select serverproperty('collation')

(b) To check a database collation

select databasepropertyex('databasename', 'collation') sqlcollation

(c) To create a database using a different collation

create database exampledatabase
collate sql_latin1_general_cp1_cs_as

(d) To create a column using a different collation

create table exampletable (
examplecolumn varchar(10) collate sql_latin1_general_cp1_ci_as null
)

(e) To modify a column collation

alter table exampletable
alter column examplecolumn varchar(10) collate sql_latin1_general_cp1_ci_as null

It is possible to change a instance and database collations but it does not affect previously created objects.

It is also possible to change a column collation on the fly for string comparison, but this is highly unrecommended in a production environment because it is extremely costly.

select
column1 collate sql_latin1_general_cp1_ci_as as column1
from table1

SQL Query: how to break column name into pieces (regexp?)

As @Chris Lively indicates a solid camel splitting function of some form is probably the best approach, but here's one query that does it all inline. Collation is obviously a consideration here for the pattern matching and in this instance I have utilised a case sensitive one in the PATINDEX function (as an aside I had to explicitly define every uppercase letter in the matching expression as [A-Z] did not return the correct results, I think that's topic for another question....)

CREATE TABLE dbo.OriginalNames 
(
CamelCaseName VARCHAR(30),
)
GO

INSERT INTO dbo.OriginalNames VALUES ('thisIsColumnName')
INSERT INTO dbo.OriginalNames VALUES ('thisIsAttributeName')
INSERT INTO dbo.OriginalNames VALUES ('thisIsAnotherAttributeName')

GO

SELECT * FROM dbo.OriginalNames;
GO

WITH
L0 AS(SELECT 1 AS c UNION ALL SELECT 1),
L1 AS(SELECT 1 AS c FROM L0 AS A, L0 AS B),
L2 AS(SELECT 1 AS c FROM L1 AS A, L1 AS B),
L3 AS(SELECT 1 AS c FROM L2 AS A, L2 AS B),
Numbers AS(SELECT ROW_NUMBER() OVER(ORDER BY c) AS n FROM L3)
SELECT DISTINCT(SplitNames.Value)
FROM (
SELECT nums.n, names.CamelCaseName, LTRIM(RTRIM(SUBSTRING(names.CamelCaseName, nums.n - 1, PATINDEX('%[|ABCDEFGHIJKLMNOPQRSTUVWXYZ]%', SUBSTRING(names.CamelCaseName + N'|', nums.n, LEN(names.CamelCaseName)) COLLATE SQL_Latin1_General_Cp1_CS_AS)))) AS [Value]
FROM Numbers AS nums INNER JOIN dbo.OriginalNames AS names ON nums.n <= CONVERT(int, LEN(names.CamelCaseName) + 1) AND PATINDEX('%[|ABCDEFGHIJKLMNOPQRSTUVWXYZ]%', SUBSTRING(N'|' + names.CamelCaseName, nums.n, 1) COLLATE SQL_Latin1_General_Cp1_CS_AS) > 0) AS SplitNames

GO

--DROP TABLE dbo.OriginalNames

-- OUTPUT as follows
--
-- Value
-- =========
-- Another
-- Attribute
-- Column
-- Is
-- Name
-- this

Equals(=) vs. LIKE

Different Operators

LIKE and = are different operators. Most answers here focus on the wildcard support, which is not the only difference between these operators!

= is a comparison operator that operates on numbers and strings. When comparing strings, the comparison operator compares whole strings.

LIKE is a string operator that compares character by character.

To complicate matters, both operators use a collation which can have important effects on the result of the comparison.

Motivating Example

Let us first identify an example where these operators produce obviously different results. Allow me to quote from the MySQL manual:

Per the SQL standard, LIKE performs matching on a per-character basis, thus it can produce results different from the = comparison operator:

mysql> SELECT 'ä' LIKE 'ae' COLLATE latin1_german2_ci;
+-----------------------------------------+
| 'ä' LIKE 'ae' COLLATE latin1_german2_ci |
+-----------------------------------------+
| 0 |
+-----------------------------------------+
mysql> SELECT 'ä' = 'ae' COLLATE latin1_german2_ci;
+--------------------------------------+
| 'ä' = 'ae' COLLATE latin1_german2_ci |
+--------------------------------------+
| 1 |
+--------------------------------------+

Please note that this page of the MySQL manual is called String Comparison Functions, and = is not discussed, which implies that = is not strictly a string comparison function.

How Does = Work?

The SQL Standard § 8.2 describes how = compares strings:

The comparison of two character strings is determined as follows:

a) If the length in characters of X is not equal to the length
in characters of Y, then the shorter string is effectively
replaced, for the purposes of comparison, with a copy of
itself that has been extended to the length of the longer
string by concatenation on the right of one or more pad
characters, where the pad character is chosen based on CS. If
CS has the NO PAD attribute, then the pad character is an
implementation-dependent character different from any
character in the character set of X and Y that collates less
than any string under CS. Otherwise, the pad character is a
<space>.

b) The result of the comparison of X and Y is given by the
collating sequence CS.

c) Depending on the collating sequence, two strings may
compare as equal even if they are of different lengths or
contain different sequences of characters. When the operations
MAX, MIN, DISTINCT, references to a grouping column, and the
UNION, EXCEPT, and INTERSECT operators refer to character
strings, the specific value selected by these operations from
a set of such equal values is implementation-dependent.

(Emphasis added.)

What does this mean? It means that when comparing strings, the = operator is just a thin wrapper around the current collation. A collation is a library that has various rules for comparing strings. Here is an example of a binary collation from MySQL:

static int my_strnncoll_binary(const CHARSET_INFO *cs __attribute__((unused)),
const uchar *s, size_t slen,
const uchar *t, size_t tlen,
my_bool t_is_prefix)
{
size_t len= MY_MIN(slen,tlen);
int cmp= memcmp(s,t,len);
return cmp ? cmp : (int)((t_is_prefix ? len : slen) - tlen);
}

This particular collation happens to compare byte-by-byte (which is why it's called "binary" — it doesn't give any special meaning to strings). Other collations may provide more advanced comparisons.

For example, here is a UTF-8 collation that supports case-insensitive comparisons. The code is too long to paste here, but go to that link and read the body of my_strnncollsp_utf8mb4(). This collation can process multiple bytes at a time and it can apply various transforms (such as case insensitive comparison). The = operator is completely abstracted from the vagaries of the collation.

How Does LIKE Work?

The SQL Standard § 8.5 describes how LIKE compares strings:

The <predicate>

M LIKE P

is true if there exists a partitioning of M into substrings
such that:

i) A substring of M is a sequence of 0 or more contiguous
<character representation>s of M and each <character
representation> of M is part of exactly one substring.

ii) If the i-th substring specifier of P is an arbitrary
character specifier, the i-th substring of M is any single
<character representation>.

iii) If the i-th substring specifier of P is an arbitrary string
specifier, then the i-th substring of M is any sequence of
0 or more <character representation>s.

iv) If the i-th substring specifier of P is neither an
arbitrary character specifier nor an arbitrary string specifier,
then the i-th substring of M is equal to that substring
specifier according to the collating sequence of
the <like predicate>, without the appending of <space>
characters to M, and has the same length as that substring
specifier.

v) The number of substrings of M is equal to the number of
substring specifiers of P.

(Emphasis added.)

This is pretty wordy, so let's break it down. Items ii and iii refer to the wildcards _ and %, respectively. If P does not contain any wildcards, then only item iv applies. This is the case of interest posed by the OP.

In this case, it compares each "substring" (individual characters) in M against each substring in P using the current collation.

Conclusions

The bottom line is that when comparing strings, = compares the entire string while LIKE compares one character at a time. Both comparisons use the current collation. This difference leads to different results in some cases, as evidenced in the first example in this post.

Which one should you use? Nobody can tell you that — you need to use the one that's correct for your use case. Don't prematurely optimize by switching comparison operators.

Entity Framework core - Contains is case sensitive or case insensitive?

It used to be the case for older versions of EF core. Now string.Contains is case sensitive, and for exemple for sqlite it maps to sqlite function `instr()' ( I don't know for postgresql).

If you want to compare strings in a case-insensitive way, you have DbFunctions to do the jobs.

context.Counties.Where(x => EF.Functions.Like(x.Name, $"%{keyword}%")).ToList();

UPDATE to @Gert:

A part of the assumption in the question is incorrect. string.Contains does NOT convert into a LIKE expression even though it USED to be the case in ef core versions <= 1.0 (I think).

  • In SQLServer string.contains converts into CHARINDEX(), in oracle and sqlite into instr() which are case sensitive by default UNLESS db or column collation is defined otherwise ( Again, I don't know for postgresql ).
  • In all cases EF.Functions.Like() converts into a SQL LIKE expression which is case-insensitive by default unless db or column collation is defined otherwise.

So yes it all goes down to collation but - correct me if I'm wrong - in a way the code can have an influence on the case-sensitive/insensitive search depending on which one of the above method you use.

Now, I might not be completely up to date but I don't think EF core migrations deal with DB collation naturally and unless you've already created the table manually you will end up with the default collation (case-sensitive for sqlite and I honestly don't know for the others).

Getting back to the original question you have at least 2 options to perform this case-insensitive search if not 3 in a future release :

  1. Specify the column collation on creation using DbContext.OnModelCreating() using this trick
  2. Replace your string.Contains by EF.Functions.Like()
  3. Or wait for a promising feature still in discussion : EF.Functions.Collate() function


Related Topics



Leave a reply



Submit