How do I properly use std::string on UTF-8 in C++?
Unicode Glossary
Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:
- Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
- Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.
This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.
UTF Primer
Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.
In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:
- UTF-8: 1 to 4 Code Units,
- UTF-16: 1 or 2 Code Units,
- UTF-32: 1 Code Unit.
std::string
and std::wstring
.
- Do not use
std::wstring
if you care about portability (wchar_t
is only 16 bits on Windows); usestd::u32string
instead (akastd::basic_string<char32_t>
). - The in-memory representation (
std::string
orstd::wstring
) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing). - While a 32-bits
wchar_t
ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.
If you are only reading or composing strings, you should have no to little issues with std::string
or std::wstring
.
Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.
Picking std::string
or std::u32string
?
If performance is a concern, it is likely that std::string
will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.
If Grapheme Clusters are not a problem, then std::u32string
has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string
work out of the box.
If you interface with software taking std::string
or char*
/char const*
, then stick to std::string
to avoid back-and-forth conversions. It'll be a pain otherwise.
UTF-8 in std::string
.
UTF-8 actually works quite well in std::string
.
Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.
Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:
str.find('\n')
works,str.find("...")
works for matching byte by byte1,str.find_first_of("\r\n")
works if searching for ASCII characters.
Similarly, regex
should mostly works out of the box. As a sequence of characters ("haha"
) is just a sequence of bytes ("哈"
), basic search patterns should work out of the box.
Be wary, however, of character classes (such as [:alphanum:]
), as depending on the regex flavor and implementation it may or may not match Unicode characters.
Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?"
may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?"
.
1 The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string
will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.
How to write Unicode string to file with UTF-8 BOM by C++?
The example below works fine in VS 2015 or new gcc compilers:
#include <iostream>
#include <string>
#include <fstream>
#include <codecvt>
int main()
{
std::string utf8 = u8"日本医療政策機構\nPhở\n";
std::ofstream f("c:\\test\\ut8.txt");
unsigned char bom[] = { 0xEF,0xBB,0xBF };
f.write((char*)bom, sizeof(bom));
f << utf8;
return 0;
}
In older versions of Visual Studio you have to declare UTF16 string (with L
prefix), then convert from UTF16 to UTF8:
#include <iostream>
#include <string>
#include <fstream>
#include <Windows.h>
std::string get_utf8(const std::wstring &wstr)
{
if (wstr.empty()) return std::string();
int sz = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), 0, 0, 0, 0);
std::string res(sz, 0);
WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &res[0], sz, 0, 0);
return res;
}
std::wstring get_utf16(const std::string &str)
{
if (str.empty()) return std::wstring();
int sz = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), 0, 0);
std::wstring res(sz, 0);
MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &res[0], sz);
return res;
}
int main()
{
std::string utf8 = get_utf8(L"日本医療政策機構\nPhở\n");
std::ofstream f("c:\\test\\ut8.txt");
unsigned char bom[] = { 0xEF,0xBB,0xBF };
f.write((char*)bom, sizeof(bom));
f << utf8;
return 0;
}
Storing unicode UTF-8 string in std::string
If you were using C++11 then this would be easy:
std::string msg = u8"महसुस";
But since you are not, you can use escape sequences and not rely on the source file's charset to manage the encoding for you, this way your code is more portable (in case you accidentally save it in a non-UTF8 format):
std::string msg = "\xE0\xA4\xAE\xE0\xA4\xB9\xE0\xA4\xB8\xE0\xA5\x81\xE0\xA4\xB8"; // "महसुस"
Otherwise, you might consider doing a conversion at runtime instead:
std::string toUtf8(const std::wstring &str)
{
std::string ret;
int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
if (len > 0)
{
ret.resize(len);
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
}
return ret;
}
std::string msg = toUtf8(L"महसुस");
Write Unicode (UTF-8) text file
You shouldn't be using old Pascal I/O at all. That did its job back in the 80s but is very obsolete today.
This century, you can use the TStringList
. This is very commonly used in Delphi. For instance, VCL controls use TStrings
to access a memo's lines of text and a combo box's or list box's items.
var SL := TStringList.Create;
try
SL.Add('∫cos(x)dx = sin(x) + C');
SL.Add('¬(a ∧ b) ⇔ ¬a ∨ ¬b');
SL.SaveToFile(FileName, TEncoding.UTF8);
finally
SL.Free;
end;
Fore more advanced needs, you can use a TStreamWriter
:
var SW := TStreamWriter.Create(FileName, False, TEncoding.UTF8);
try
SW.WriteLine('αβγδε');
SW.WriteLine('ωφψξη');
finally
SW.Free;
end;
And for very simple needs, there are the new TFile
methods in IOUtils.pas
:
var S := '⌬ is aromatic.';
TFile.WriteAllText(FileName, S, TEncoding.UTF8); // string (possibly with linebreaks)
var Lines: TArray<string>;
Lines := ['☃ is cold.', '☼ is hot.'];
TFile.WriteAllLines(FileName, Lines, TEncoding.UTF8); // string array
As you can see, all these modern options allow you to specify UTF8 as encoding. If you prefer to use some other encoding, like UTF16, that's fine too.
Just forget about AssignFile
, Reset
, Rewrite
, Append
, CloseFile
etc.
Related Topics
Do C++11 Lambdas Capture Variables They Don't Use
Is There a Compact Equivalent to Python Range() in C++/Stl
Show Two Digits After Decimal Point in C++
Loop with a Zero Execution Time
Can the 'Type' of a Lambda Expression Be Expressed
Float Bits and Strict Aliasing
Opencv Surf Function Is Not Implemented
What Does This C Code Do [Duff's Device]
Why Sizeof Int Is Wrong, While Sizeof(Int) Is Right
Static VS Non-Static Variables in Namespace
Printing Double Without Losing Precision
Convert Eigen Matrix to C Array
Deciphering C++ Template Error Messages