How do locales work in Linux / POSIX and what transformations are applied?
I have boiled down the problem to an issue with the strcoll()
function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq
depending on the current locale was:
$ echo -e "\xc9\xa2\n\xc9\xac" > test.txt
$ cat test.txt
ɢ
ɬ
$ LC_COLLATE=C uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 uniq -D test.txt
ɢ
ɬ
Obviously, if the locale is en_US.UTF-8
uniq
treats ɢ
and ɬ
as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind
and investigated both call graphs with kcachegrind
.
$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt
$ kcachegrind callgrind.out.5754 &
$ kcachegrind callgrind.out.5763 &
The only difference was, that the version with LC_COLLATE=en_US.UTF-8
called strcoll()
whereas LC_COLLATE=C
did not. So I came up with the following minimal example on strcoll()
:
#include <iostream>
#include <cstring>
#include <clocale>
int main()
{
const char* s1 = "\xc9\xa2";
const char* s2 = "\xc9\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;
std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::cout << std::endl;
s1 = "\xa2";
s2 = "\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;
std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
}
Output:
ɢ
ɬ
0
-1
-10
-1
�
�
0
-1
-10
-1
So, what's wrong here? Why does strcoll()
returns 0
(equal) for two different characters?
Treatment of spaces in sort command. Difference between LC_COLLATE=c and LC_COLLATE= en_US.UTF-8
punctuation is ignored when ordering in the en_US locale
Note sort can explicitly skip whitespace with the -b option,
but note that's trick to use, so I'd advise using the sort --debug
option when using that.
Unicode normalization in strcoll
To the best of my knowledge, there is no mention of Unicode normalization neither in the C nor in the C++, nor in the POSIX standards.
Therefore, implementations may leave normalization as something to be done explicitely by the programmer.
More explicitely, in glibc european locales apparently use ISO 14651 as collation algorithm. The Unicode Collation FAQ implies that ISO 14651 doesn't do normalization: uniform handling of canonical equivalents is listed as a difference between the UCA and ISO 14651.
Undesired character encoding translation in transform output
OK.. it seems like I don't need jvmArgs line in the above xsltCleanup task, IF the following 2 variables are SET in ~/.bashrc. Having jvmArgs line in the task solved DEV local builds but it still didn't resolve the behavior when Jenkins build (CM Team) had LANG set to something different than UTF-8 (at Jenkins/System).
export LANG=en_US.UTF-8
export GRADLE_OPTS="-Dfile.encoding=UTF-8"
Making the above change in my ~/.bashrc and ~/.bash_profile (calling ~/.bashrc).. fully resolved the issue for local builds in Windows(Cygwin) and in a local Linux. Setting the above two variable/properties in Jenkins GLOBAL Settings config page, did the trick for Jenkins build as well. One can also set these at the job config level.
Arun
How to apply cctype functions on text files with different encoding in c++
Unicode defines "code points" for characters. A code point is a 32 bit value.
There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.
Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.
While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.
In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer
tolower and such functions can work with a locale, e.g.
#include <iostream>
#include <locale>
int main() {
wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
std::cout << "s = " << s << std::endl;
std::cout << "sL= " << sL << std::endl;
return 0;
}
outputs:
s = 214
sL= 246
In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.
In Linux the terminal can be set to use a locale with the help of LC_ALL
, LANG
and LANGUAGE
, e.g.:
//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"
//English
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"
Related Topics
Open a File Directly from a Gitlab Private Repository
Unzip a Bunch of Zips into Their Own Directories
Is /Dev/Random Considered Truly Random
How to Compile Glibc 32Bit on an X86_64 MAChine
Linux: How to Detect That Ftp File Upload Is Finished
Building a Simple (Hello-World-Esque) Example of Using Ld's Option -Rpath with $Origin
Merging Through Fuzzy Matching of Variables in R
How to Mmap the Stack for the Clone() System Call on Linux
Bash: Delete Based on File Date Stamp
How to Connect to Postgresql Server: Could Not Connect to Server: Permission Denied
Linux Kernel: How to Capture a Key Press and Replace It with Another Key