String.Tolower() and String.Tolowerinvariant()

string.ToLower() and string.ToLowerInvariant()

Depending on the current culture, ToLower might produce a culture specific lowercase letter, that you aren't expecting. Such as producing ınfo without the dot on the i instead of info and thus mucking up string comparisons. For that reason, ToLowerInvariant should be used on any non-language-specific data. When you might have user input that might be in their native language/character-set, would generally be the only time you use ToLower.

See this question for an example of this issue:
C#- ToLower() is sometimes removing dot from the letter "I"

ToLower vs ToLowerInvariant

Try the Turkish dotted İ:

var culture = new CultureInfo("tr-TR");

string test = "İ";

if (test.ToLower(culture) == test.ToLowerInvariant())
Console.WriteLine("Same");
else
Console.WriteLine("Different"); // Prints this!

How does String.ToLowerInvariant() determine to what string/character it must convert?

According to the Unicode standard, the sources for Case Mapping Information are

UnicodeData.txt: Contains the case mappings that map to a single character. These do not increase the length of strings, nor do they contain context-dependent mappings.

SpecialCasing.txt: Contains additional case mappings that map to more than one character, such as “ß” to “SS”. Also contains context-dependent mappings, with flags to distinguish them from the normal mappings, as well as some locale-dependent mappings.

In UnicodeData.txt, you'll find:

0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
010C;LATIN CAPITAL LETTER C WITH CARON;Lu;0;L;0043 030C;;;;N;LATIN CAPITAL LETTER C HACEK;;;010D;
010D;LATIN SMALL LETTER C WITH CARON;Ll;0;L;0063 030C;;;;N;LATIN SMALL LETTER C HACEK;;010C;;010C

(The last three columns contain the simple uppercase, lowercase and titlecase mapping.)

So, unless there are locale-dependent exceptions, every Unicode implementation will use these mappings, resulting in:

uppercase(i) = I
uppercase(č) = Č
lowercase(Č) = č

The file SpecialCasing.txt says:

The entries in this file are in the following machine-readable format:

<code>; <lower>; <title>; <upper>; (<condition_list>;)? # <comment>

and

A condition list overrides the normal behavior if all of the listed conditions are true.

For Turkish, it contains the following exception:

# When uppercasing, i turns into a dotted capital I

0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I

So, for the Turkish (and Azeri) language:

uppercase(i) = İ

There are also some exceptions for Lithuanian. Except for these few exceptions, case mappings should always be the same, regardless of the .NET "culture".

What is wrong with ToLowerInvariant()?

Google gives a hint pointing to CA1308: Normalize strings to uppercase

It says:

Strings should be normalized to uppercase. A small group of characters, when they are converted to lowercase, cannot make a round trip. To make a round trip means to convert the characters from one locale to another locale that represents character data differently, and then to accurately retrieve the original characters from the converted characters.

So, yes - ToUpper is more reliable than ToLower.

In the future I suggest googling first - I do that for all those FxCop warnings I get thrown around ;) Helps a lot to read the corresponding documentation ;)

What is the correct way to compare char ignoring case?

It depends on what you mean by "work for all cultures". Would you want "i" and "I" to be equal even in Turkey?

You could use:

bool equal = char.ToUpperInvariant(x) == char.ToUpperInvariant(y);

... but I'm not sure whether that "works" according to all cultures by your understanding of "works".

Of course you could convert both characters to strings and then perform whatever comparison you want on the strings. Somewhat less efficient, but it does give you all the range of comparisons available in the framework:

bool equal = x.ToString().Equals(y.ToString(), 
StringComparison.InvariantCultureIgnoreCase);

For surrogate pairs, a Comparer<char> isn't going to be feasible anyway, because you don't have a single char. You could create a Comparer<int> though.

.net core / standard string.ToLower() has no culture parameter

It looks like the capability is there, just in a more roundabout way. Instead of:

string output = input.ToLower(culture);

use

string output = culture.TextInfo.ToLower(input);

Also note that the overload has been added in netstandard2.0. The implementation is basically the code above.

C#- ToLower() is sometimes removing dot from the letter I

Try using String.ToLowerInvariant().

Difference between string.ToLower and TextInfo.ToLower

There is none.

string.ToLower calls TextInfo.ToLower behind the scenes.

From String.cs:

    // Creates a copy of this string in lower case. 
public String ToLower() {
return this.ToLower(CultureInfo.CurrentCulture);
}

// Creates a copy of this string in lower case. The culture is set by culture.
public String ToLower(CultureInfo culture) {
if (culture==null) {
throw new ArgumentNullException("culture");
}
return culture.TextInfo.ToLower(this);
}


Related Topics



Leave a reply



Submit