What Encoding Does Std::String.C_Str() Use

What encoding does std::string.c_str() use?

std::string per se uses no encoding -- it will return the bytes you put in it. For example, those bytes might be using ISO-8859-1 encoding... or any other, really: the information about the encoding is just not there -- you have to know where the bytes were coming from!

Does std::string in c++ has encoding format

The simple answer

std::string is defined as std::basic_string<char> which means it is a collection of chars. As a collection of chars it can potentially hold chars that are the encoded result of a utf8 string.

The following code is valid till C++20:

std::string s = u8"1 שלום Hello";
std::cout << s << std::endl;

And it prints, in a console that supports it:

1 שלום Hello

The u8 before the parenthesized string is the string literal for utf8 telling the compiler that the following parenthesized string has utf8 encoding.

Without the u8 prefix notation the compiler would take the string based on the source encoding of the compiler, so if the default encoding or the encoding explicitly set for the compiler supports the chars in the string it can take it also like this:

std::string s = "1 שלום Hello";
std::cout << s << std::endl;

with the same output as above. However this is platform and compiler dependent.

If the source encoding of the compiler doesn't support these chars, for example if we are setting in gcc the source encoding to LATIN with the flag -fexec-charset=ISO-8859-1 the string without u8 prefix gives the following compilation error:

converting to execution character set:
Invalid or incomplete multibyte or wide character
std::string s = "1 שלום Hello";
^~~~~~~~~~~~~~

Since C++20 u8 parenthesized string cannot be converted into std::string:

std::string s = u8"1 שלום Hello";
std::cout << s << std::endl;

gives the following compilation error in C++20:

conversion from 'const char8_t [17]' to non-scalar type 'std::string'
{aka 'std::__cxx11::basic_string<char>'} requested
std::string s = u8"1 שלום Hello";
^~~~~~~~~~~~~~~~~

This is because the type of u8 parenthesized string in C++20 is not const char[SIZE] but rather const char8_t[SIZE] (the type char8_t was introduced in C++20).

You can use however in C++20 the new type std::u8string:

std::u8string s = u8"1 שלום Hello"; // good - std::u8string added in C++20
// std::cout << s << std::endl; // oops, std::ostream doesn't support u8string

A few interesting notes:

  1. till C++20 a u8 parenthesized string is const char[SIZE]
  2. from C++20 a u8 parenthesized string is const char8_t[SIZE]
  3. the size of char8_t is the same as char, but it is a distinct type

The long story

Encoding is a sad story in C++. This is probably why there is no "simple answer" for your question. There isn't still a fully fledged end-to-end standard solution for handling character encoding. There are std converters, 3rd party libraries etc. But not a real tight and simple solution. Hopefully C++23 would solve this.

See CppCon 2019 session on the subject, by JeanHeyd Meneide

Also a related question: how std::u8string will be different from std::string?

How do I properly use std::string on UTF-8 in C++?

Unicode Glossary

Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:

  1. Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
  2. Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.

This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.



UTF Primer

Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.

In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:

  • UTF-8: 1 to 4 Code Units,
  • UTF-16: 1 or 2 Code Units,
  • UTF-32: 1 Code Unit.


std::string and std::wstring.

  1. Do not use std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string<char32_t>).
  2. The in-memory representation (std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).
  3. While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.

If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.

Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.



Picking std::string or std::u32string?

If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.

If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.

If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.



UTF-8 in std::string.

UTF-8 actually works quite well in std::string.

Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:

  • str.find('\n') works,
  • str.find("...") works for matching byte by byte1,
  • str.find_first_of("\r\n") works if searching for ASCII characters.

Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.

Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.

Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".

1 The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.

Can std::string::c_str() be used whenever a string literal is expected?

Looking at the documentation you linked to, it seems like you are trying to call the overload of AddMember taking two StringRefTypes (and an Allocator). StringRefType is a typedef for GenericStringRef<Ch>, which has two overloaded constructors taking a single argument:

template<SizeType N>
GenericStringRef(const CharType(&str)[N]) RAPIDJSON_NOEXCEPT;

explicit GenericStringRef(const CharType *str);

When you pass a string literal, the type is const char[N], where N is the length of the string + 1 (for the null terminator). This can be implicitly converted to a GenericStringRef<Ch> using the first constructor overload. However, std::string::c_str() returns a const char*, which cannot be converted implicitly to a GenericStringRef<Ch>, because the second constructor overload is declared explicit.

The error message you get from the compiler is caused by it choosing another overload of AddMember which is a closer match.

Does `std::wregex` support utf-16/unicode or only UCS-2?

C++ standard doesn't enforce any encoding on std::string and std::wstring. They're simply a series of CharT. Only std::u8string, std::u16string and std::u32string have defined encoding

  • What encoding does std::string.c_str() use?
  • Does std::string in c++ has encoding format

Similarly std::regex and std::wregex also wrap around std::basic_string and CharT. Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale)

On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle

Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.

https://en.wikipedia.org/wiki/UTF-8#Description

The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead

In other languages normalization takes place.

This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"

If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words

The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library

Related:

  • Do C++11 regular expressions work with UTF-8 strings?
  • How well is Unicode supported in C++11?
  • How do I properly use std::string on UTF-8 in C++?
  • How to use Unicode range in C++ regex

See also

  • Unicode Regular Expressions
  • Unicode Support in the Standard Library

Storing unicode UTF-8 string in std::string

If you were using C++11 then this would be easy:

std::string msg = u8"महसुस";

But since you are not, you can use escape sequences and not rely on the source file's charset to manage the encoding for you, this way your code is more portable (in case you accidentally save it in a non-UTF8 format):

std::string msg = "\xE0\xA4\xAE\xE0\xA4\xB9\xE0\xA4\xB8\xE0\xA5\x81\xE0\xA4\xB8"; // "महसुस"

Otherwise, you might consider doing a conversion at runtime instead:

std::string toUtf8(const std::wstring &str)
{
std::string ret;
int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
if (len > 0)
{
ret.resize(len);
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
}
return ret;
}

std::string msg = toUtf8(L"महसुस");

escaped std::string to NSString*

After reading of all internet sites, forums, blog posts and etc I guess I've found the reason of the problem I've described above.

I couldn't believe this string is correct all the time and I was right - this is garbage.
C++ object of Codegen-class was incorrect all the time.

Goodly this problem explained here.

This problem with codegen I've solved by modifying class to be objective-c++ class.
My fork of original echoprint-ios-example here. Feel free to get it and use.

Also a pull request to parent repository has been sent.

C++ character encoding when converting from string to const char* for Ruby FFI interface

The value returned by c_str() is destroyed as soon as the std::string goes out of scope.
If you intend to pass this value to your script you should allocate memory and copy the string into your newly allocated space. See this example: http://www.cplusplus.com/reference/string/string/c_str/

You should also ensure the ruby script will correctly release memory.

I think this is what is explained there: https://github.com/ffi/ffi/wiki/Examples.

Example with a struct passed to Ruby from C:
https://github.com/ffi/ffi/wiki/Examples#-structs



Related Topics



Leave a reply



Submit