Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment?
Partly because the file systems expect NUL ('\0') bytes to terminate file names, so UTF-16 would not work well. You'd have to modify a lot of code to make that change.
Which Unicode encoding does the Linux kernel use?
http://www.xsquawkbox.net/xpsdk/mediawiki/Unicode says
Linux
On Linux, UTF8 is the 'native' encoding for all strings, and is the format accepted by system routines like
fopen()
.
so Linux is like Plan 9 in that respect, and boost::filesystem and Unicode under Linux and Windows notes
It looks to me like
boost::filesystem
under Linux does not provide a wide character string inpath::native()
, despiteboost::filesystem::path
having been initialized with a wide string.
which would rule out UTF-16 and UTF-32 since all variants of those require wide character support -- NUL bytes allowed inside strings.
Handling UTF-8 in C++
Don't use wstring on Linux.
std::wstring VS std::string
Take a look at first answer. I'm sure it answers your question.
- When I should use std::wstring over std::string?
On Linux? Almost never (§).
On Windows? Almost always (§).
Sed and UTF-8 encoding
It was on Putty configuration ==> Translation ==> Received data assumed to be in which character set ==> Choose UTF-8
Best regards
Why is hexdump of UTF-16 string when passed in as a command line argument different from what it is directly on the terminal?
The command echo -n hello | iconv -f ascii -t utf-16 | hexdump -C
just pipes data directly between programs. Whatever bytes come out of iconv are taken directly as input to hexdump.
With the command ./test $(echo -n hello | iconv -f ascii -t utf-16)
, the shell takes the output of iconv, and effectively pastes it into a new command, parses the new command, and then executes it.
So the bytes coming out of iconv are: "ff fe 68 00 65 00 6c 00 6c 00 6f 00" and the shell parses this. It appears as though the shell simply skips null bytes when parsing, so the argument input to your program is just the non-null bytes. Since your string is ascii that means the result is just an ascii string (preceded by a UTF-16 BOM).
We can demonstrate this using a character like U+3300 (㌀). If we pass this instead of an ascii character and the above is correct, then the output will include 0x33 (the digit '3').
./test $(echo -n ㌀ | iconv -f utf-8 -t utf-16)
My terminal happens to use UTF-8, which supports the character U+3300, so I have iconv convert from that to UTF-16. I get the output:
The string:
0000 ff fe 33 ..3
By the way, your program includes a hard coded size for the array:
hexDump("The string", str, 12);
You really shouldn't do that. If the array isn't that big then you get undefined behavior, and your post shows some garbage being printed out after the real argument (the garbage appears to be the beginning of the environment variable array). There's really no reason for this. Just use the right value:
hexDump("The string", str, strlen(str));
When encoding actually matters? (e.g., string storing, printing?)
(I remember that Bjarne says that
encoding is the mapping between char
and integer(s) so char should be
stored as integer(s) in memory)
Not quite. Make sure you understand one important distinction.
- A character is the minimum unit of text. A letter, digit, punctuation mark, symbol, space, etc.
- A byte is the minimum unit of memory. On the overwhelming majority of computers, this is 8 bits.
Encoding is converting a sequence of characters to a sequence of bytes. Decoding is converting a sequence of bytes to a sequence of characters.
The confusing thing for C and C++ programmers is that char
means byte, NOT character! The name char
for the byte type is a legacy from the pre-Unicode days when everyone (except East Asians) used single-byte encodings. But nowadays, we have Unicode, and its encoding schemes which have up to 4 bytes per character.
Question 1: If I store one-byte string
in std::string or two-byte string in
std::wstring, will the underlying
integer value depend on the encoding
currently in use?
Yes, it will. Suppose you have std::string euro = "€";
Then:
- With the windows-1252 encoding, the string will be encoded as the byte 0x80.
- With the ISO-8859-15 encoding, the string will be encoded as the byte 0xA4.
- With the UTF-8 encoding, the string will be encoded as the three bytes 0xE2, 0x82, 0xAC.
Question 3: What is the default
encoding in one particular system, and
how to change it(Is it so-called
"locale")?
Depends on the platform. On Unix, the encoding can be specified as part of the LANG
environment variable.
~$ echo $LANG
en_US.utf8
Windows has a GetACP
function to get the "ANSI" code page number.
Question 4: What if I print a string
to the screen with std::cout, is it
the same encoding?
Not necessarily. On Windows, the command line uses the "OEM" code page, which is usually different from the "ANSI" code page used elsewhere.
’ showing on page instead of '
Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.
Or use ’
.
Is it possible to use a Unicode argv?
In general, no. It will depend on the O/S, but the C standard says that the arguments to 'main()' must be 'main(int argc, char **argv)' or equivalent, so unless char and wchar_t are the same basic type, you can't do it.
Having said that, you could get UTF-8 argument strings into the program, convert them to UTF-16 or UTF-32, and then get on with life.
On a Mac (10.5.8, Leopard), I got:
Osiris JL: echo "ï€" | odx
0x0000: C3 AF E2 82 AC 0A ......
0x0006:
Osiris JL:
That's all UTF-8 encoded. (odx is a hex dump program).
See also: Why is it that UTF-8 encoding is used when interacting with a UNIX/Linux environment
Related Topics
How to Record What Process or Kernel Activity Is Using the Disk in Gnu/Linux
Getting List of Network Devices Inside the Linux Kernel
Sending Keyboard Input to a Program from Command-Line
Help with Understanding a Very Basic Main() Disassembly in Gdb
Bash Alias with Argument and Autocompletion
How to Add Date String to Each Line of a Continuously Written Log File
Upstart Calling Script (For Inserted Usb-Drive)
Difference Between Flat Memory Model and Protected Memory Model
Windows Authentication Headers Without .Net. Possible
Extract Tar the Tar.Bz2 File Error
How to Reinstall the Latest Cmake Version
How to Programmatically Create Videos
Execute Sudo Command on Linux from Plink.Exe on Windows
Combine a Shell Script and a Zip File into a Single Executable for Deployment
Easy Way of Installing Eclipse Plugins on Ubuntu
How to Automatically Start a Node.Js Application in Amazon Linux Ami on Aws