Why Is It That Utf-8 Encoding Is Used When Interacting with a Unix/Linux Environment

Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment?

Partly because the file systems expect NUL ('\0') bytes to terminate file names, so UTF-16 would not work well. You'd have to modify a lot of code to make that change.

Which Unicode encoding does the Linux kernel use?

http://www.xsquawkbox.net/xpsdk/mediawiki/Unicode says

Linux
On Linux, UTF8 is the 'native' encoding for all strings, and is the format accepted by system routines like fopen().

so Linux is like Plan 9 in that respect, and boost::filesystem and Unicode under Linux and Windows notes

It looks to me like boost::filesystem under Linux does not provide a wide character string in path::native(), despite boost::filesystem::path having been initialized with a wide string.

which would rule out UTF-16 and UTF-32 since all variants of those require wide character support -- NUL bytes allowed inside strings.

Handling UTF-8 in C++

Don't use wstring on Linux.

std::wstring VS std::string

Take a look at first answer. I'm sure it answers your question.

When I should use std::wstring over std::string?

On Linux? Almost never (§).

On Windows? Almost always (§).

Sed and UTF-8 encoding

It was on Putty configuration ==> Translation ==> Received data assumed to be in which character set ==> Choose UTF-8
Best regards

Why is hexdump of UTF-16 string when passed in as a command line argument different from what it is directly on the terminal?

The command echo -n hello | iconv -f ascii -t utf-16 | hexdump -C just pipes data directly between programs. Whatever bytes come out of iconv are taken directly as input to hexdump.

With the command ./test $(echo -n hello | iconv -f ascii -t utf-16), the shell takes the output of iconv, and effectively pastes it into a new command, parses the new command, and then executes it.

So the bytes coming out of iconv are: "ff fe 68 00 65 00 6c 00 6c 00 6f 00" and the shell parses this. It appears as though the shell simply skips null bytes when parsing, so the argument input to your program is just the non-null bytes. Since your string is ascii that means the result is just an ascii string (preceded by a UTF-16 BOM).

We can demonstrate this using a character like U+3300 (㌀). If we pass this instead of an ascii character and the above is correct, then the output will include 0x33 (the digit '3').

./test $(echo -n ㌀ | iconv -f utf-8 -t utf-16)

My terminal happens to use UTF-8, which supports the character U+3300, so I have iconv convert from that to UTF-16. I get the output:

The string:
  0000  ff fe 33                                         ..3

By the way, your program includes a hard coded size for the array:

hexDump("The string", str, 12);

You really shouldn't do that. If the array isn't that big then you get undefined behavior, and your post shows some garbage being printed out after the real argument (the garbage appears to be the beginning of the environment variable array). There's really no reason for this. Just use the right value:

hexDump("The string", str, strlen(str));

When encoding actually matters? (e.g., string storing, printing?)

(I remember that Bjarne says that
encoding is the mapping between char
and integer(s) so char should be
stored as integer(s) in memory)

Not quite. Make sure you understand one important distinction.

A character is the minimum unit of text. A letter, digit, punctuation mark, symbol, space, etc.
A byte is the minimum unit of memory. On the overwhelming majority of computers, this is 8 bits.

Encoding is converting a sequence of characters to a sequence of bytes. Decoding is converting a sequence of bytes to a sequence of characters.

The confusing thing for C and C++ programmers is that char means byte, NOT character! The name char for the byte type is a legacy from the pre-Unicode days when everyone (except East Asians) used single-byte encodings. But nowadays, we have Unicode, and its encoding schemes which have up to 4 bytes per character.

Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value depend on the encoding
currently in use?

Yes, it will. Suppose you have std::string euro = "€"; Then:

With the windows-1252 encoding, the string will be encoded as the byte 0x80.
With the ISO-8859-15 encoding, the string will be encoded as the byte 0xA4.
With the UTF-8 encoding, the string will be encoded as the three bytes 0xE2, 0x82, 0xAC.

Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?

Depends on the platform. On Unix, the encoding can be specified as part of the LANG environment variable.

~$ echo $LANG
en_US.utf8

Windows has a GetACP function to get the "ANSI" code page number.

Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?

Not necessarily. On Windows, the command line uses the "OEM" code page, which is usually different from the "ANSI" code page used elsewhere.

â€™ showing on page instead of '

Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.

Or use ’.

Is it possible to use a Unicode argv?

In general, no. It will depend on the O/S, but the C standard says that the arguments to 'main()' must be 'main(int argc, char **argv)' or equivalent, so unless char and wchar_t are the same basic type, you can't do it.

Having said that, you could get UTF-8 argument strings into the program, convert them to UTF-16 or UTF-32, and then get on with life.

On a Mac (10.5.8, Leopard), I got:

Osiris JL: echo "ï€" | odx
0x0000: C3 AF E2 82 AC 0A                                 ......
0x0006:
Osiris JL:

That's all UTF-8 encoded. (odx is a hex dump program).

See also: Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment

Why Is It That Utf-8 Encoding Is Used When Interacting with a Unix/Linux Environment