Understanding logic in CaseInsensitiveComparator
Normally, we would expect to convert the case once and compare and be done with it. However, the code convert the case twice, and the reason is stated in the comment on a different method public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len)
:
Unfortunately, conversion to uppercase does not work properly for the Georgian alphabet, which has strange rules about case conversion. So we need to make one last check before exiting.
Appendix
The code of regionMatches
has a few difference from the code in the CaseInsenstiveComparator
, but essentially does the same thing. The full code of the method is quoted below for the purpose of cross-checking:
public boolean regionMatches(boolean ignoreCase, int toffset,
String other, int ooffset, int len) {
char ta[] = value;
int to = offset + toffset;
char pa[] = other.value;
int po = other.offset + ooffset;
// Note: toffset, ooffset, or len might be near -1>>>1.
if ((ooffset < 0) || (toffset < 0) || (toffset > (long)count - len) ||
(ooffset > (long)other.count - len)) {
return false;
}
while (len-- > 0) {
char c1 = ta[to++];
char c2 = pa[po++];
if (c1 == c2) {
continue;
}
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
return false;
}
return true;
}
Curious about the implementation of CaseInsensitiveComparator
There are Unicode characters which are different in lowercase, but have the same uppercase form. For example the Greek letter Sigma - it has two lowercase forms (σ, and ς which is only used at the end of the word), but only one uppercase form (Σ).
I could not find any examples of the reverse, but if such a situation happened in the future, the current Java implementation is already prepared for this. Your version of the Comparator
would definitely handle the Sigma case correctly.
You can find more information in the Case Mapping FAQ on the Unicode website.
Why CaseInsensitiveComparator does not uppercase and lowercase comparisions
From Unicode Standard:
In addition, because of the vagaries of natural language, there are
situations where two different Unicode characters have the same
uppercase or lowercase
So sometimes you will find that the lowercase of the letters are same but the uppercase is different and hence the comparison.
Also check the source which says:
Unfortunately, conversion to uppercase does not work properly
for the Georgian alphabet, which has strange rules about case
conversion. So we need to make one last check before exiting.
Java: Why String.compareIgnoreCase() uses both Character.toUpperCase() and Character.toLowerCase()?
Here's an example using Turkish i's:
System.out.println(Character.toUpperCase('i') == Character.toUpperCase('İ'));
System.out.println(Character.toLowerCase('i') == Character.toLowerCase('İ'));
The first line prints false
; the second true
. Ideone demo.
Java String ignore case implementation
Some characters exist only in lower case, some only exist in upper case. For example, in Germany we have the character "ß" which is lower case. There is no upper case version of it.
I assume that the same can happen in the opposite direction in other languages.
Case-insensitive Comparator breaks my TreeMap
It happens because TreeMap
considers elements equal if a.compareTo(b) == 0
. It's documented in the JavaDoc for TreeMap (emphasis mine):
Note that the ordering maintained by a tree map, like any sorted map, and whether or not an explicit comparator is provided, must be consistent with
equals
if this sorted map is to correctly implement the Map interface. (SeeComparable
orComparator
for a precise definition of consistent withequals
.) This is so because the Map interface is defined in terms of theequals
operation, but a sorted map performs all key comparisons using itscompareTo
(orcompare
) method, so two keys that are deemed equal by this method are, from the standpoint of the sorted map, equal. The behavior of a sorted map is well-defined even if its ordering is inconsistent withequals
; it just fails to obey the general contract of the Map interface.
Your comparator isn't consistent with equals.
If you want to keep not-equal-but-equal-ignoring-case elements, put a second level of checking into your comparator, to use case-sensitive ordering:
public int compare(String o1, String o2) {
int cmp = o1.compareToIgnoreCase(o2);
if (cmp != 0) return cmp;
return o1.compareTo(o2);
}
Related Topics
Servlet Seems to Handle Multiple Concurrent Browser Requests Synchronously
How to Read Input Character-By-Character in Java
How to Convert List to JSON in Java
Nullpointerexception Through Auto-Boxing-Behavior of Java Ternary Operator
How to Use Java to Read from a File That Is Actively Being Written To
Convert a String of Hex into Ascii in Java
Regex: How to Escape Backslashes and Special Characters
Jdbc Connection to Mssql Server in Windows Authentication Mode
How to Demonstrate Java Multithreading Visibility Problems
How to Set Icon in a Column of Jtable
How Can My Java Program Store Files Inside of Its .Jar File
Java Executors: How to Set Task Priority
How to Make a Color Transparent in a Bufferedimage and Save as Png
Java's Date(...) Constructor Is Deprecated; What Does That Mean
Class.Getresource() Returns Null