Using Locales with Java's Tolowercase() and Touppercase()

Using Locales with Java's toLowerCase() and toUpperCase()

I think you should use locale ,

For instance, "TITLE".toLowerCase() in a Turkish locale returns
"tıtle", where 'ı' is the LATIN SMALL LETTER DOTLESS I character. To
obtain correct results for locale insensitive strings, use
toLowerCase(Locale.ENGLISH).

I refer to these links as solution to your problem
and it has point to keep in mind in you situation "Turkish"

**FROM THE LINKS**

toLowerCase() respects internationalization (i18n). It performs the
case conversion with respect to your Locale. When you call
toLowerCase(), internally toLowerCase(Locale.getDefault()) is getting
called. It is locale sensitive and you should not write a logic around
it interpreting locale independently.

import java.util.Locale;

public class ToLocaleTest {
public static void main(String[] args) throws Exception {
Locale.setDefault(new Locale("lt")); //setting Lithuanian as locale
String str = "\u00cc";
System.out.println("Before case conversion is "+str+
" and length is "+str.length());// Ì
String lowerCaseStr = str.toLowerCase();
System.out.println("Lower case is "+lowerCaseStr+
" and length is "+lowerCaseStr.length());// iı`
}
}

In the above program, look at the string length before and after
conversion. It will be 1 and 3. Yes the length of the string before
and after case conversion is different. Your logic will go for a toss
when you depend on string length on this scenario. When your program
gets executed in a different environment, it may fail. This will be a
nice catch in code review.

To make it safer, you may use another method
toLowerCase(Locale.English) and override the locale to English always.
But then you are not internationalized.

So the crux is, toLowerCase() is locale specific.

reference 1

reference 2

reference 3

Dotless-i, is a lowercase 'i' without dot. The uppercase of this character is the usual "I". There is another character, "I with dot". The lowercase of this character is the usual lowercase "i".

Have you noticed the problem? This unsymetrical conversion causes a serious problem in programming. We face this problem mostly in Java applications because of (IMHO) poor implementation of toLowerCase and toUpperCase functions.

In Java, String.toLowerCase() method converts characters to lowercase according to the default locale. This causes problems if your application works in Turkish locale and especially if you are using this function for a file name or a url that must obey a certain character set.

I have blogged about two serious examples before: The compile errors with Script libraries with "i" in their names and XSP Manager's fault if an XPage is in a database with "I" in its name.

There is a long history, as I said. For instance in some R7 version, router was unable to send a message to a recipient if his/her name starts with "I". Message reporting agents was not running in Turkish locale until R8. Anyone with Turkish locale could not install Lotus Notes 8.5.1 (it's real!). The list goes on...

There is almost no beta tester from Turkey and customers don't open PMR for these problems. So these problems are not going up to the first priority for development teams.

Even Java team has added a special warning to the latest documentation:

This method is locale sensitive, and may produce unexpected results if
used for strings that are intended to be interpreted locale
independently. Examples are programming language identifiers, protocol
keys, and HTML tags. For instance, "TITLE".toLowerCase() in a Turkish
locale returns "tıtle", where 'ı' is the LATIN SMALL LETTER DOTLESS I
character. To obtain correct results for locale insensitive strings,
use toLowerCase(Locale.ENGLISH).

Why Java Character.toUpperCase/toLowerCase has no Locale parameter like String.toUpperCase/toLowerCase

From the Character#toUpperCase(int) Javadoc,

In general, String.toUpperCase() should be used to map characters to uppercase. String case mapping methods have several benefits over Character case mapping methods. String case mapping methods can perform locale-sensitive mappings, context-sensitive mappings, and 1:M character mappings, whereas the Character case mapping methods cannot.

So, the answer is your second example (String.toUpperCase)

Purpose of String.toLowerCase() with default locale?

Several blog posts suggest that default locales and charsets indeed were a design mistake and have no meaningful use.

Scala vs Java toUpperCase/toLowerCase

The standard approach is close to your method 2, but much simpler. In shared code you just call

Platform.toUpperLocaleInsensitive(string)

which has different implementations on JVM and JS:

// JVM
object Platform {
def toUpperLocaleInsensitive(s: String) = s.toUpperCase(Locale.ROOT)

// other methods with different implementations
}

// JS
object Platform {
def toUpperLocaleInsensitive(s: String) = s.toUpperCase()

// other methods with different implementations
}

See the description of a similar case in Hands-on Scala.js.

This works because shared code doesn't need to compile by itself, only together with platform-specific code.

Which Locale should I specify when I call String#toLowerCase?

Yes, Locale.ENGLISH is a safe choice for case operations for things like programming language identifiers and URL parts since it doesn't involve any special casing rules and all 7-bit ASCII characters in the ENGLISH case-convert to 7-bit ASCII characters.

That is not true for all other locales. In Turkish, the 'I' and 'i' characters are not case-converted to one another.

"Dotted and dotless I" explains:

The Turkish alphabet, which is a variant of the Latin alphabet, includes two distinct versions of the letter I, one dotted and the other dotless.

In Unicode, U+0131 is a lower case letter dotless i (ı). U+0130 (İ) is capital i with dot. ISO-8859-9 has them at positions 0xFD and 0xDD respectively. In normal typography, when lower case i is combined with other diacritics, the dot is generally removed before the diacritic is added; however, Unicode still lists the equivalent combining sequences as including the dotted i, since logically it is the normal dotted i character that is being modified.

Most Unicode software uppercases ı to I and lowercases İ to i, but, unless specifically set up for Turkish, it lowercases I to i and uppercases i to I. Thus uppercasing then lowercasing, or vice versa, changes the letters.

The list of special exceptions is maintained at http://unicode.org/Public/UNIDATA/SpecialCasing.txt

# ================================================================================

# Turkish and Azeri

# I and i-dotless; I-dot and i are case pairs in Turkish and Azeri
# The following rules handle those cases.

0130; 0069; 0130; 0130; tr; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; 0069; 0130; 0130; az; # LATIN CAPITAL LETTER I WITH DOT ABOVE

# When lowercasing, remove dot_above in the sequence I + dot_above, which will turn into i.
# This matches the behavior of the canonically equivalent I-dot_above

0307; ; 0307; 0307; tr After_I; # COMBINING DOT ABOVE
0307; ; 0307; 0307; az After_I; # COMBINING DOT ABOVE

...

toLowerCase() method in Java when used with Locale does not produce the exact result

Different languages have different rules to transform to upper- or lower-case.

For example, in German, the lowercase ß becomes two uppercase S, so the word "straße" (a street), which is 6 characters long, becomes "STRASSE", which is 7 characters long.

This is why your upper-cased and lower-cased strings have different lengths.

I wrote about this in one of my Java Quiz :
http://thecodersbreakfast.net/index.php?post/2010/09/24/Java-Quiz-42-%3A-A-string-too-far

Using toUpperCase with Correct Locale

Double-check that the bytecode you are analyzing is actually your most recent build output, and that you're looking at the same line forbiddenapis is :) . This looks to me like your source/bytecode/analysis are falling out of sync — the relevant rule shouldn't flag an error on String.toUpperCase(Locale).

Disclaimer: I haven't used forbiddenapis myself --- I wrote this answer based on the repo and on a blog post I found.



Related Topics



Leave a reply



Submit