Differencebetween C.Utf-8 and En_Us.Utf-8 Locales

What is the difference between C.UTF-8 and en_US.UTF-8 locales?

In general C is for computer, en_US is for people in US who speak English (and other people who want the same behaviour).

The for computer means that the strings are sometime more standardized (but still in English), so an output of a program could be read from an other program. With en_US, strings could be improved, alphabetic order could be improved (maybe by new rules of Chicago rules of style, etc.). So more user-friendly, but possibly less stable. Note: locales are not just for translation of strings, but also for collation (alphabetic order, numbers (e.g. thousand separator), currency (I think it is safe to predict that $ and 2 decimal digits will remain), months, day of weeks, etc.

In your case, it is just the UTF-8 version of both locales.

In general it should not matter. I usually prefer en_US.UTF-8, but usually it doesn't matter, and in your case (server app), it should only change log and error messages (if you use locale.setlocale(). You should handle client locales inside your app. Programs that read from other programs should set C before opening the pipe, so it should not really matter.

As you see, probably it doesn't matter. You may also use POSIX locale, also define in Debian. You get the list of installed locales with locale -a.

Note: Micro-optimization will prescribe C/C.UTF-8 locale: no translation of files (gettext), and simple rules on collation and number formatting, but this should visible only on server side.

Locale environment variables: difference between C and C.UTF-8

I'd recommend to use UTF-8 locale which is more versatile.

For example, in Git Bash :

LC_ALL=C grep -P hello /dev/null
# output :
# grep: -P supports only unibyte and UTF-8 locales

LC_ALL=C.UTF-8 grep -P hello /dev/null
# No output

std::wcin.eof(), UTF-8 and locales on different systems

This is a libc++ bug.

Note the bug report says that it only affects std::wcin and not file streams, but in my experiments this is not the case. All wchar_t streams seem to be affected.

The other major open source implementation, libstdc++, doesn't have this bug. It is possible to sidestep the libc++ bug by building the entire application (including all dynamic libraries, if any) against libstdc++.

If this is not an option, then one way to cope with the bug is to use narrow char streams, and then, when needed, recode the characters (presumably arriving encoded as UTF-8) to wchar_t (presumably UCS-4) separately. Another way is to get rid of wchar_t altogether and work in UTF-8 throughout the program, which is probably better in the long run.

C.UTF-8 C++ locale on Windows?

Windows API does not respect the CRT locales, and the CRT implementation of fopen etc. directly call the narrow-char API, therefore changing the locale will not affect the encoding.

However, Windows 10 May 2019 Update (version 1903) introduced a support for UTF-8 in its narrow-char APIs. It can be enabled by embedding an appropriate manifest into your executable. Unfortunately it's a very recent addition, and so might not be an option if you need to target older systems.

Your other options include converting manually to wchar_t or using a layer that does that for you (like Boost.Filesystem, or even better, Boost.Nowide).

glibc's isalpha function and the en_US.UTF-8 locale

How does the C function isalpha work when the locale is set to something other than C (in other words, something like en_US.UTF-8)?

The first 128 characters of Unicode represent the same as ASCII, so nothing changes (when C locale uses ASCII).

What really changes, is that instead of using a hardcoded list, glibc opens and loads the locale file. I believe that would be from /usr/lib/locale/locale-archive which should contain the compiled locale from /usr/share/i18n/locales/* files. In my /usr/share/i18n/locales/en_US file I see LC_CTYPE copy "en_GB" , I can go to en_GB which has copy "i18n", then to i18n which has copy "i18n_ctype", then finally to i18n_ctype file which has:

% The "alpha" class of the "i18n" FDCC-set is reflecting
% the recommendations in TR 10176 annex A
alpha /
<U0041>..<U005A>;<U0061>..<U007A>;<U00AA>;<U00B5>;<U00BA>;/
<U00C0>..<U00D6>;<U00D8>..<U00F6>;<U00F8>..<U02C1>;<U02C6>..<U02D1>;/
.... many more lines ....

I can confirm that isalpha will return true/1 for values outside of the traditional ASCII text ranges

From C99 7.4p1:

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.

The loop : for(int i=0;i<INT_MAX;i++) { if(isalpha(i)) { is just undefined behavior for any i greater then UCHAR_MAX. Some programmers even do isalpha((unsigned char)i). (I remember getting a warning in some cases when is<ctype>(arg) functions arguments was not an unsigned char).

Is this just a hard coded list of utf code points in a range somewhere? Or something less direct?

Yes, as mentioned above in /usr/share/i18n/locales/* files.

And the hardcoded list for C locale is stored in locale/C-ctype.c and is meant to match POSIX.

Where is en_US.UTF-8 defined?

It's Unicode.

/usr/lib/locale/en_US.utf8/LC_COLLATE is created by localedef. man localedef shows the input path /usr/share/i18n/locales.

/usr/share/i18n/locales/en_US § LC_COLLATE references file iso14651_t1, which references iso14651_t1_common, which is a file published by ISO, which tells us the originating source unidata-9.0.0.txt. Run git clone git://sourceware.org/git/glibc.git to see the history of these files.

http://enwp.org/ISO_14651 says the ISO standard and UCA are aligned, so the corresponding file at unicode.org is allkeys.txt.

What is the locale of UTF-8?

To translate data that's not associated with the user's configured locale, but rather an explicitly specified encoding, you should use iconv, not mbsrtowcs. You don't need setlocale at all for this.

How do locales work in Linux / POSIX and what transformations are applied?

I have boiled down the problem to an issue with the strcoll() function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq depending on the current locale was:

$ echo -e "\xc9\xa2\n\xc9\xac" > test.txt
$ cat test.txt
ɢ
ɬ
$ LC_COLLATE=C uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 uniq -D test.txt
ɢ
ɬ

Obviously, if the locale is en_US.UTF-8 uniq treats ɢ and ɬ as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind and investigated both call graphs with kcachegrind.

$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt
$ kcachegrind callgrind.out.5754 &
$ kcachegrind callgrind.out.5763 &

The only difference was, that the version with LC_COLLATE=en_US.UTF-8 called strcoll() whereas LC_COLLATE=C did not. So I came up with the following minimal example on strcoll():

#include <iostream>
#include <cstring>
#include <clocale>

int main()
{
const char* s1 = "\xc9\xa2";
const char* s2 = "\xc9\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;

std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;

std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;

std::cout << std::endl;

s1 = "\xa2";
s2 = "\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;

std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;

std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
}

Output:

ɢ
ɬ
0
-1
-10
-1



0
-1
-10
-1

So, what's wrong here? Why does strcoll() returns 0 (equal) for two different characters?



Related Topics



Leave a reply



Submit