Std::Wstring VS Std::String

std::string, wstring, u16/32string clarification

The difference is that the details of char and wchar_t are implementation defined, while the encoding of char16_t and char32_t are explicitly defined by the C++11 standard.

This means that wstring is likely to store the same data as either u16string or u32string, but we don't know which one. And it is allowed for some odd implementation to make them all different, as the size and encoding of the old char types are just not defined by the standard.

Compare std::wstring and std::string

Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.

First off, make sure to start your program with set_locale:

#include <clocale>

int main()
{
std::setlocale(LC_CTYPE, ""); // before any string operations
}

Now for the functions. First off, getting a wide string from a narrow string:

#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>

// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
return s;
}

// Real worker
std::wstring get_wstring(const std::string & s)
{
const char * cs = s.c_str();
const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);

if (wn == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}

std::vector<wchar_t> buf(wn + 1);
const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);

if (wn_again == size_t(-1))
{
std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
return L"";
}

assert(cs == NULL); // successful conversion

return std::wstring(buf.data(), wn);
}

And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:

// Dummy
std::string get_locale_string(const std::string & s)
{
return s;
}

// Real worker
std::string get_locale_string(const std::wstring & s)
{
const wchar_t * cs = s.c_str();
const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);

if (wn == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}

std::vector<char> buf(wn + 1);
const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);

if (wn_again == size_t(-1))
{
std::cout << "Error in wcsrtombs(): " << errno << std::endl;
return "";
}

assert(cs == NULL); // successful conversion

return std::string(buf.data(), wn);
}

Some notes:

  • If you don't have std::vector::data(), you can say &buf[0] instead.
  • I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);


In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.

Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.

Convert from std::wstring to std::string

std::string simply holds an array of bytes. It does not hold information about the encoding in which these bytes are supposed to be interpreted, nor do the standard library functions or std::string member functions generally assume anything about the encoding. They handle the contents as just an array of bytes.

Therefore when the contents of a std::string need to be presented, the presenter needs to make some guess about the intended encoding of the string, if that information is not provided in some other way.

I am assuming that the encoding you intend to convert to is UTF8, given that you are using std::codecvt_utf8.

But if you are using Virtual Studio, the debugger simply assumes one specific encoding, at least by default. That encoding is not UTF8, but I suppose probably code page 1252.

As verification, python gives the following:

>>> '日本'.encode('utf8').decode('cp1252')
'日本'

Your string does seem to be the UTF8 encoding of 日本 interpreted as if it was cp1252 encoded.

Therefore the conversion seems to have worked as intended.


As mentioned by @MarkTolonen in the comments, the encoding to assume for a string variable can be specified to UTF8 in the Visual Studio debugger with the s8 specifier, as explained in the documentation.

Convert std::string to std::wstring and vise versa

std::wstring holds wchar_t elements, and wchar_t is a different size across platforms (2 bytes on Windows, 4 bytes elsewhere), and as such std::wstring uses different encodings across platforms (UTF-16 on Windows, UTF-32 elsewhere). Just as std::string can hold different 8bit encodings (UTF-8, ISO-8859-x, Windows-125x, etc).

So, you are not asking how to convert between std::string and std::wstring themselves, but how to convert between different encodings. And the fact is, C++ simply doesn’t support that natively. C++11 tried to address that with std::codecvt and std::wstring_convert, but they are limited, and as you have noted have since been deprecated in C++17, with no replacement is sight.

So, you best options is to use 3rd party cross-platform libraries, such as ICU, ICONV, etc.

When should we prefer wide-character strings?

If we talk about Windows, than I'd use std::wstring (because we often need cool string features), or wchar_t* if you just pass strings around.

Note Microsoft recommends that here: Working with Strings

Windows natively supports Unicode strings for UI elements, file names,
and so forth. Unicode is the preferred character encoding, because it
supports all character sets and languages. Windows represents Unicode
characters using UTF-16 encoding, in which each character is encoded
as a 16-bit value. UTF-16 characters are called wide characters, to
distinguish them from 8-bit ANSI characters. The Visual C++ compiler
supports the built-in data type wchar_t for wide characters

Also:

When Microsoft introduced Unicode support to Windows, it eased the
transition by providing two parallel sets of APIs, one for ANSI
strings and the other for Unicode strings. [...] Internally, the ANSI
version translates the string to Unicode.

Also:

New applications should always call the Unicode versions. Many world
languages require Unicode. If you use ANSI strings, it will be
impossible to localize your application. The ANSI versions are also
less efficient, because the operating system must convert the ANSI
strings to Unicode at run time. [...] Most newer APIs in Windows have
just a Unicode version, with no corresponding ANSI version.

C++ issue with conversion of std::string to std::wstring - Windows vs Linux

This worked for me on POSIX.

#include <codecvt>
#include <string>
#include <locale>

int main() {

std::string a = "pokémon";
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> cv;
std::wstring wide = cv.from_bytes(a);

return 0;
}

The wstring holds the correct string at the end.

Important note by @NathanOliver: std::codecvt_utf8_utf16 was deprecated in C++17 and may be removed from the standard in a future version.

How to handle a value that may be an std::string or std::wstring

I suggest doing it in different form:

enum class Encoding{
UTF8,
UTF16
};

Encoding readFile(const char* path, std::string& utf8Result,std::wstring& utf16result);

now read the file to the correct object and return the correct encoding as result.
the use of this function can write generic code around this function with template generelization around std::basic_string:

template <class T>
void doNext(const std::basic_string<T>& result){/*...*/}

std::string possibleUTF8Result;
std::wstring possibleUTF16Result;
auto res = readFile("text.txt",possibleUTF8Result,possibleUTF16Result);
if (res == Encoding::UTF8){
doNext(possibleUTF8Result);
} else { doNext(possibleUTF16Result); }

*note: wstring is utf16 on windows but utf32 on linux.

Why doesn't STL officially support std::string to std::wstring conversions?

The character encoding of std::string is not defined by the C++ standard, a std::string can hold any encoding that can be represented using 1-byte char elements, which includes UTF-7/8, ISO-8859-x, Windows-125x, etc.

Also, the size of wchar_t is implementation-defined, not defined by the standard, so even the encoding of std::wstring can vary, too. On Windows, wchar_t is 2 bytes, so std::wstring uses UCS-2/UTF-16 encoding. Whereas on other platforms, wchar_t is 4 bytes, so std::wstring uses UCS-4/UTF-32.

So, there is no single conversion that satisfies all possible combinations of std::string <-> std::wstring conversions across all platforms and use-cases. So, you need to know the encoding of the source string, and the intended encoding of the target string, in order to perform a conversion.

And yes, the C++ standard did provide std::codecvt and std::wstring_convert/std::wbuffer_convert for this task, but they have been deprecated, as you have noted. There is no standard replacement provided (yet?).

So, you are best off using 3rd party Unicode API/libraries to handle character conversions.



Related Topics



Leave a reply



Submit