Understanding Logic in Caseinsensitivecomparator

Understanding logic in CaseInsensitiveComparator

Normally, we would expect to convert the case once and compare and be done with it. However, the code convert the case twice, and the reason is stated in the comment on a different method public boolean regionMatches(boolean ignoreCase, int toffset, String other, int ooffset, int len):

Unfortunately, conversion to uppercase does not work properly for the Georgian alphabet, which has strange rules about case conversion. So we need to make one last check before exiting.

Appendix

The code of regionMatches has a few difference from the code in the CaseInsenstiveComparator, but essentially does the same thing. The full code of the method is quoted below for the purpose of cross-checking:

public boolean regionMatches(boolean ignoreCase, int toffset,
                       String other, int ooffset, int len) {
    char ta[] = value;
    int to = offset + toffset;
    char pa[] = other.value;
    int po = other.offset + ooffset;
    // Note: toffset, ooffset, or len might be near -1>>>1.
    if ((ooffset < 0) || (toffset < 0) || (toffset > (long)count - len) ||
            (ooffset > (long)other.count - len)) {
        return false;
    }
    while (len-- > 0) {
        char c1 = ta[to++];
        char c2 = pa[po++];
        if (c1 == c2) {
            continue;
        }
        if (ignoreCase) {
            // If characters don't match but case may be ignored,
            // try converting both characters to uppercase.
            // If the results match, then the comparison scan should
            // continue.
            char u1 = Character.toUpperCase(c1);
            char u2 = Character.toUpperCase(c2);
            if (u1 == u2) {
                continue;
            }
            // Unfortunately, conversion to uppercase does not work properly
            // for the Georgian alphabet, which has strange rules about case
            // conversion.  So we need to make one last check before
            // exiting.
            if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
                continue;
            }
        }
        return false;
    }
    return true;
}

Curious about the implementation of CaseInsensitiveComparator

There are Unicode characters which are different in lowercase, but have the same uppercase form. For example the Greek letter Sigma - it has two lowercase forms (σ, and ς which is only used at the end of the word), but only one uppercase form (Σ).

I could not find any examples of the reverse, but if such a situation happened in the future, the current Java implementation is already prepared for this. Your version of the Comparator would definitely handle the Sigma case correctly.

You can find more information in the Case Mapping FAQ on the Unicode website.

Why CaseInsensitiveComparator does not uppercase and lowercase comparisions

From Unicode Standard:

In addition, because of the vagaries of natural language, there are
situations where two different Unicode characters have the same
uppercase or lowercase

So sometimes you will find that the lowercase of the letters are same but the uppercase is different and hence the comparison.

Also check the source which says:

Unfortunately, conversion to uppercase does not work properly
for the Georgian alphabet, which has strange rules about case
conversion. So we need to make one last check before exiting.

Java: Why String.compareIgnoreCase() uses both Character.toUpperCase() and Character.toLowerCase()?

Here's an example using Turkish i's:

System.out.println(Character.toUpperCase('i') == Character.toUpperCase('İ'));
System.out.println(Character.toLowerCase('i') == Character.toLowerCase('İ'));

The first line prints false; the second true. Ideone demo.

Java String ignore case implementation

Some characters exist only in lower case, some only exist in upper case. For example, in Germany we have the character "ß" which is lower case. There is no upper case version of it.

I assume that the same can happen in the opposite direction in other languages.

Case-insensitive Comparator breaks my TreeMap

It happens because TreeMap considers elements equal if a.compareTo(b) == 0. It's documented in the JavaDoc for TreeMap (emphasis mine):

Note that the ordering maintained by a tree map, like any sorted map, and whether or not an explicit comparator is provided, must be consistent with equals if this sorted map is to correctly implement the Map interface. (See Comparable or Comparator for a precise definition of consistent with equals.) This is so because the Map interface is defined in terms of the equals operation, but a sorted map performs all key comparisons using its compareTo (or compare) method, so two keys that are deemed equal by this method are, from the standpoint of the sorted map, equal. The behavior of a sorted map is well-defined even if its ordering is inconsistent with equals; it just fails to obey the general contract of the Map interface.

Your comparator isn't consistent with equals.

If you want to keep not-equal-but-equal-ignoring-case elements, put a second level of checking into your comparator, to use case-sensitive ordering:

    public int compare(String o1, String o2) {
        int cmp = o1.compareToIgnoreCase(o2);
        if (cmp != 0) return cmp;

        return o1.compareTo(o2);
    }

Understanding Logic in Caseinsensitivecomparator