Is a Wide Character String Literal Starting with L Like L"Hello World" Guaranteed to Be Encoded in Unicode

Is a wide character string literal starting with L like LHello World guaranteed to be encoded in Unicode?

The L symbol in front of a string literal simply means that each character in the string will be stored as a wchar_t. But this doesn't necessarily imply Unicode. For example, you could use a wide character string to encode GB 18030, a character set used in China which is similar to Unicode. The C++03 standard doesn't have anything to say about Unicode, (however C++11 defines Unicode char types and string literals) so it's up to you to properly represent Unicode strings in C++03.

Regarding string literals, Chapter 2 (Lexical Conventions) of the C++ standard mentions a "basic source character set", which is basically equivalent to ASCII. So this essentially guarantees that "abc" will be represented as a 3-byte string (not counting the null), and L"abc" will be represented as a 3 * sizeof(wchar_t)-byte string of wide-characters.

The standard also mentions "universal-character-names" which allow you to refer to non-ASCII characters using the \uXXXX hexadecimal notation. These "universal-character-names" usually map directly to Unicode values, but the standard doesn't guarantee that they have to. However, you can at least guarantee that your string will be represented as a certain sequence of bytes by using universal-character-names. This will guarantee Unicode output provided the runtime environment supports Unicode, has the appropriate fonts installed, etc.

As for string literals in C++03 source files, again there is no guarantee. If you have a Unicode string literal in your code which contains characters outside of the ASCII range, it is up to your compiler to decide how to interpret these characters. If you want to explicitly guarantee that the compiler will "do the right thing", you'd need to use \uXXXX notation in your string literals.

Is handling unicode character with wchar_t good? Does it not cause any problems?

You are confusing two different things:

  1. Storage

    How you store the bytes that make up your text string. Will that be in an array of char (single-byte) values? Or will it be in the form of wchar_t (multi-byte) values?

  2. Encoding

    Your computer (and you!) needs to know what to do with the values in those bytes. What do they mean? Regardless of storage, they could be ASCII, some code page, UTF-8, UTF-16, UTF-32, Klingon, anything.

Usually, for historical reasons, we pick char for single-byte encodings (e.g. ASCII) and UTF-8, and wchar_t for UTF-16 (particularly on Windows, which has 16-bit wchar_ts and generally assumes this combination throughout its API — note that it inaccurately calls this simply "Unicode").

Performance doesn't really come into it, though you'll save time and energy converting between different encodings if you pick one and stick to it (and use a storage mechanism that fits the string libraries you're using). Sometimes your OS will help determine that choice, but we can't tell you what it will be.

Similarly, your statements about what "works" and "doesn't work" are very vague, and likely false.

We can't say what's "ok" without knowing the requirements of your project, and what sort of computer it'll run on, and with what technologies. I will, though, make a tremendous generalisation: in the olden days, you might have used Mazovia encoding, an altered codepage that included Polish characters; nowadays, you probably want to make portability and interchange as easy as possible (because why not?!), so you'd be encouraged to stick with UTF-16 over wchar_t on Windows, and UTF-8 over char otherwise.

(From C++20 we'll also have char8_t, a storage mechanism specifically designed to signify that it stores UTF-8-encoded data; however, it's going to be some time before you see this in widespread use, if at all. You can read more about C++'s character types on cppreference.com's article about "Fundamental types")

Why is there an L when assigning a wstring?

You're on the right track that this has something to do with LPSTR, LPWSTR, and LPCSTR, but that's not precisely what's going on here. Those types aren't actually a part of standard C++ and are a Microsoft-specific set of types that are used with the Windows API.

C++ has two built-in types for characters, char, which is used with std::string, and wchar_t, which is used with std::wstring. If you write a string literal in regular quotes, C++ treats it as an string made of chars, which can be used with std::string. To tell C++ that you're trying to make a string literal made of wchar_ts - which is what you'll need to use to work with std::wstring, you need to prefix it with an L because that's just how C++ is defined.

Note that this is totally independent of the Microsoft types mentioned above.

C++: Native to Managed String Conversion Problem (Maybe Character Set)?

LPCWSTR returnLpcuwstr = returnWString.c_str();
return returnLpcuwstr;

This is returning a pointer to data that gets freed immediately after the return, when returnWString goes out of scope. The returned pointer is invalid before the receiver can even use it. This is undefined behavior.

To do what you are attempting, you will have to return a pointer to dynamically allocated memory, and then the receiver will have to free that memory when done using it.

Assuming by "managed" you are referring to .NET, then .NET's marshaller frees unmanaged memory using CoTaskMemFree(), so if you are using default marshaling, the returned pointer must be pointing at memory that is allocated using CoTaskMemAlloc() or equivalent (SysAllocString...(), for instance).

Otherwise, if you are not using default marshaling (ie, you are calling Marshal.PtrToStringUni() manually instead), then you will have to make the .NET code pass the memory pointer back to your C++ code so it can then free the memory properly. Then your C++ code can allocate the memory however you want to (as long as it is still allocated dynamically so it can survive past the function return).

Does `std::wregex` support utf-16/unicode or only UCS-2?

C++ standard doesn't enforce any encoding on std::string and std::wstring. They're simply a series of CharT. Only std::u8string, std::u16string and std::u32string have defined encoding

  • What encoding does std::string.c_str() use?
  • Does std::string in c++ has encoding format

Similarly std::regex and std::wregex also wrap around std::basic_string and CharT. Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale)

On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle

Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.

https://en.wikipedia.org/wiki/UTF-8#Description

The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead

In other languages normalization takes place.

This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"

If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words

The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library

Related:

  • Do C++11 regular expressions work with UTF-8 strings?
  • How well is Unicode supported in C++11?
  • How do I properly use std::string on UTF-8 in C++?
  • How to use Unicode range in C++ regex

See also

  • Unicode Regular Expressions
  • Unicode Support in the Standard Library


Related Topics



Leave a reply



Submit