How Do Locales Work in Linux/Posix and What Transformations Are Applied

How do locales work in Linux / POSIX and what transformations are applied?

I have boiled down the problem to an issue with the strcoll() function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq depending on the current locale was:

$ echo -e "\xc9\xa2\n\xc9\xac" > test.txt
$ cat test.txt
ɢ
ɬ
$ LC_COLLATE=C uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 uniq -D test.txt
ɢ
ɬ

Obviously, if the locale is en_US.UTF-8 uniq treats ɢ and ɬ as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind and investigated both call graphs with kcachegrind.

$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt
$ kcachegrind callgrind.out.5754 &
$ kcachegrind callgrind.out.5763 &

The only difference was, that the version with LC_COLLATE=en_US.UTF-8 called strcoll() whereas LC_COLLATE=C did not. So I came up with the following minimal example on strcoll():

#include <iostream>
#include <cstring>
#include <clocale>

int main()
{
const char* s1 = "\xc9\xa2";
const char* s2 = "\xc9\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;

std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;

std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;

std::cout << std::endl;

s1 = "\xa2";
s2 = "\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;

std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;

std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
}

Output:

ɢ
ɬ
0
-1
-10
-1



0
-1
-10
-1

So, what's wrong here? Why does strcoll() returns 0 (equal) for two different characters?

Treatment of spaces in sort command. Difference between LC_COLLATE=c and LC_COLLATE= en_US.UTF-8

punctuation is ignored when ordering in the en_US locale

Note sort can explicitly skip whitespace with the -b option,
but note that's trick to use, so I'd advise using the sort --debug
option when using that.

Unicode normalization in strcoll

To the best of my knowledge, there is no mention of Unicode normalization neither in the C nor in the C++, nor in the POSIX standards.

Therefore, implementations may leave normalization as something to be done explicitely by the programmer.

More explicitely, in glibc european locales apparently use ISO 14651 as collation algorithm. The Unicode Collation FAQ implies that ISO 14651 doesn't do normalization: uniform handling of canonical equivalents is listed as a difference between the UCA and ISO 14651.

Undesired character encoding translation in transform output

OK.. it seems like I don't need jvmArgs line in the above xsltCleanup task, IF the following 2 variables are SET in ~/.bashrc. Having jvmArgs line in the task solved DEV local builds but it still didn't resolve the behavior when Jenkins build (CM Team) had LANG set to something different than UTF-8 (at Jenkins/System).

export LANG=en_US.UTF-8
export GRADLE_OPTS="-Dfile.encoding=UTF-8"

Making the above change in my ~/.bashrc and ~/.bash_profile (calling ~/.bashrc).. fully resolved the issue for local builds in Windows(Cygwin) and in a local Linux. Setting the above two variable/properties in Jenkins GLOBAL Settings config page, did the trick for Jenkins build as well. One can also set these at the job config level.

Arun

How to apply cctype functions on text files with different encoding in c++

Unicode defines "code points" for characters. A code point is a 32 bit value.

There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.

Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.

While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.

In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer

tolower and such functions can work with a locale, e.g.

#include <iostream>
#include <locale>

int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;

return 0;
}

outputs:

s = 214
sL= 246

In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.

In Linux the terminal can be set to use a locale with the help of LC_ALL, LANG and LANGUAGE, e.g.:

//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"

//English
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"


Related Topics



Leave a reply



Submit