How to Convert Utf-8 Std::String to Utf-16 Std::Wstring

How to convert UTF-8 std::string to UTF-16 std::wstring?

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.

std::wstring utf8_to_utf16(const std::string& utf8)
{
std::vector<unsigned long> unicode;
size_t i = 0;
while (i < utf8.size())
{
unsigned long uni;
size_t todo;
bool error = false;
unsigned char ch = utf8[i++];
if (ch <= 0x7F)
{
uni = ch;
todo = 0;
}
else if (ch <= 0xBF)
{
throw std::logic_error("not a UTF-8 string");
}
else if (ch <= 0xDF)
{
uni = ch&0x1F;
todo = 1;
}
else if (ch <= 0xEF)
{
uni = ch&0x0F;
todo = 2;
}
else if (ch <= 0xF7)
{
uni = ch&0x07;
todo = 3;
}
else
{
throw std::logic_error("not a UTF-8 string");
}
for (size_t j = 0; j < todo; ++j)
{
if (i == utf8.size())
throw std::logic_error("not a UTF-8 string");
unsigned char ch = utf8[i++];
if (ch < 0x80 || ch > 0xBF)
throw std::logic_error("not a UTF-8 string");
uni <<= 6;
uni += ch & 0x3F;
}
if (uni >= 0xD800 && uni <= 0xDFFF)
throw std::logic_error("not a UTF-8 string");
if (uni > 0x10FFFF)
throw std::logic_error("not a UTF-8 string");
unicode.push_back(uni);
}
std::wstring utf16;
for (size_t i = 0; i < unicode.size(); ++i)
{
unsigned long uni = unicode[i];
if (uni <= 0xFFFF)
{
utf16 += (wchar_t)uni;
}
else
{
uni -= 0x10000;
utf16 += (wchar_t)((uni >> 10) + 0xD800);
utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
}
}
return utf16;
}

How is const std::wstring encoded and how to change to UTF-16

As clarified in the comments, the source .cpp file is UTF-8 encoded. Without a BOM, and without an explicit /source-charset:utf-8 switch, the Visual C++ compiler defaults to assuming the source file is saved in the active codepage encoding. From the Set Source Character Set documentation:

By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you specify a character set name or code page by using the /source-charset option.

The UTF-8 encoding of äöüß is C3 A4 C3 B6 C3 BC C3 9F, and therefore the line:

    std::wstring wstr = L"äöüß";

is seen by the compiler as:

    std::wstring wstr = L"\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x9F"`;

Assuming the active codepage to be the usual Windows-1252, the (extended) characters map as:

    win-1252    char    unicode

\xC3 Ã U+00C3
\xA4 ¤ U+00A4
\xB6 ¶ U+00B6
\xBC ¼ U+00BC
\x9F Ÿ U+0178

Therefore L"\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x9F" gets translated to:

    std::wstring wstr = L"\u00C3\u00A4\u00C3\u00B6\u00C3\u00BC\u00C3\u0178"`;

To avoid such (mis)translation, Visual C++ needs to be told that the source file is encoded as UTF-8 by passing an explicit /source-charset:utf-8 (or /utf-8) compiler switch. For CMake based projects, this can be done using add_compile_options as shown at Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819.

Convert C++ std::string to UTF-16-LE encoded string

Apologies, firsthand... this will be an ugly reply with some long code. I ended up using the following function, while effectively compiling in iconv into my windows application file by file :)

Hope this helps.

char* conver(const char* in, size_t in_len, size_t* used_len)
{
const int CC_MUL = 2; // 16 bit
setlocale(LC_ALL, "");
char* t1 = setlocale(LC_CTYPE, "");
char* locn = (char*)calloc(strlen(t1) + 1, sizeof(char));
if(locn == NULL)
{
return 0;
}

strcpy(locn, t1);
const char* enc = strchr(locn, '.') + 1;

#if _WINDOWS
std::string win = "WINDOWS-";
win += enc;
enc = win.c_str();
#endif

iconv_t foo = iconv_open("UTF-16LE", enc);

if(foo == (void*)-1)
{
if (errno == EINVAL)
{
fprintf(stderr, "Conversion from %s is not supported\n", enc);
}
else
{
fprintf(stderr, "Initialization failure:\n");
}
free(locn);
return 0;
}

size_t out_len = CC_MUL * in_len;
size_t saved_in_len = in_len;
iconv(foo, NULL, NULL, NULL, NULL);
char* converted = (char*)calloc(out_len, sizeof(char));
char *converted_start = converted;
char* t = const_cast<char*>(in);
int ret = iconv(foo,
&t,
&in_len,
&converted,
&out_len);
iconv_close(foo);
*used_len = CC_MUL * saved_in_len - out_len;

if(ret == -1)
{
switch(errno)
{
case EILSEQ:
fprintf(stderr, "EILSEQ\n");
break;
case EINVAL:
fprintf(stderr, "EINVAL\n");
break;
}

perror("iconv");
free(locn);
return 0;
}
else
{
free(locn);
return converted_start;
}
}

UTF8 to UTF16 conversion using std::filesystem::path

What kind of drawbacks can be expected from such converter?

Well, let's get the most obvious drawback out of the way. For a user who doesn't know what you're doing, it makes no sense. Doing UTF-8-to-16 conversion by using a path type is bonkers, and should be seen immediately as a code smell. It's the kind of awful hack you do when you are needlessly averse to just downloading a simple library that would do it correctly.

Also, it doesn't have to work. path is meant for storing... paths. Hence the name. Specifically, they're meant for storing paths in a way easily consumed by the filesystem in question. As such, the string stored in a path can have any limitations that the filesystem wants to put on it, outside of a small plethora of things the C++ standard requires it to do.

For example, if the filesystem is case-insensitive (or even just ASCII-case-insensitive), it is a legitimate implementation to have it just case-convert all strings to lowercase when they are stored in a path. Or to case-convert them when you extract them from a path. Or anything of the like.

path can convert all of your \s into /s. Or your :s into /'s. Or any other implementation-dependent tricks it wants to do.

If you're afraid of using a deprecated facility, just download a simple UTF-8/16 converting library. Or write one yourself; it isn't that difficult.

Issue when converting utf16 wide std::wstring to utf8 narrow std::string for rare characters

The problem is that std::wstring_convert<std::codecvt_utf8<wchar_t>> converts from UCS-2, not from UTF-16. Characters inside of the BMP (U+0000..U+FFFF) have identical encodings in both UCS-2 and UTF-16 and so will work, but characters outside of the BMP (U+FFFF..U+10FFFF), such as your Emoji, do not exist in UCS-2 at all. This means the conversion doesn't understand the character and produces incorrect UTF-8 bytes (technically, it's converted each half of the UTF-16 surrogate pair into a separate UTF-8 character).

You need to use std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> instead.

how does one convert std::u16string - std::wstring using codecvt?

The std::wstring_convert and std::codecvt... classes are deprecated in C++17 onward. There is no longer a standard way to convert between the various string classes.

If your compiler still supports the classes, you can certainly use them. However, you cannot convert directly from std::u16string to std::wstring (and vice versa) with them. You will have to convert to an intermediate UTF-8 std::string first, and then convert that afterwards, eg:

std::u16string utf16 = ...;

std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> utf16conv;
std::string utf8 = utf16conv.to_bytes(utf16);

std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> wconv;
std::wstring wstr = wconv.from_bytes(utf8);

Just know that this approach will break when the classes are eventually dropped from the standard library.

Using std::copy() (or simply the various std::wstring data construct/assign methods) will work only on Windows, where wchar_t and char16_t are both 16-bit in size representing UTF-16:

std::u16string utf16 = ...;
std::wstring wstr;

#ifdef _WIN32
wstr.reserve(utf16.size());
std::copy(utf16.begin(), utf16.end(), std::back_inserter(wstr));
/*
or: wstr = std::wstring(utf16.begin(), utf16.end());
or: wstr.assign(utf16.begin(), utf16.end());
or: wstr = std::wstring(reinterpret_cast<const wchar_t*>(utf16.c_str()), utf16.size());
or: wstr.assign(reinterpret_cast<const wchar_t*>(utf16.c_str()), utf16.size());
*/
#else
// do something else ...
#endif

But, on other platforms, where wchar_t is 32-bit in size representing UTF-32, you will need to actually convert the data, using the code shown above, or a platform-specific API or 3rd party Unicode library that can do the data conversion, such as libiconv, ICU. etc.

How to get C++ std::string from Little-Endian UTF-16 encoded bytes

For the sake of completeness, here's the simplest iconv based conversion I came up with

#include <iconv.h>

auto iconv_eng = ::iconv_open("UTF-8", "UTF-16LE");
if (reinterpret_cast<::iconv_t>(-1) == iconv_eng)
{
std::cerr << "Unable to create ICONV engine: " << strerror(errno) << std::endl;
}
else
{
// src a char * to utf16 bytes
// src_size the maximum number of bytes to convert
// dest a char * to utf8 bytes to generate
// dest_size the maximum number of bytes to write
if (static_cast<std::size_t>(-1) == ::iconv(iconv_eng, &src, &src_size, &dest, &dest_size))
{
std::cerr << "Unable to convert from UTF16: " << strerror(errno) << std::endl;
}
else
{
std::string utf8_str(src);
::iconv_close(iconv_eng);
}
}

UTF8 data to std::string or std::wstring

Storing utf-8 in the std::string is no more than storing sequence of bytes in "vector". The std::string is not aware of any encoding stuff whatsoever, and any member function like find or <algorithm> function like std::find would not work once you need to work beyond standard ASCII. So it is up to you how you gonna handle this situation, you can try and convert your input (L"Ñ") to utf-8 sequence and try to find it in std::string or you can convert your string to wstring and work directly on it. IMHO, in your case when you have to manipulate (search, extract words, split by letters or replace, and all this beyond ASCII range) the input you better stick to wstring and before posting it to client convert to utf-8 std::string
EDIT001: As of std::codecvt_utf8 mentioned above in a comment and my comment about performance concerns. Here is the test

std::wstring foo(const std::string& input)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
return converter.from_bytes(input.c_str());
}

std::wstring baz(const std::string& input)
{
std::wstring retVal;
auto targetSize = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), static_cast<int>(input.size()), NULL, 0);
retVal.resize(targetSize);
auto res = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), static_cast<int>(input.size()),
const_cast<LPWSTR>(retVal.data()), targetSize);
if(res == 0)
{
// handle error, throw, do something...
}
return retVal;
}

int main()
{
std::string input = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut "
"labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco "
"laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in "
"voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat "
"cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

{
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 100'000; ++i)
{
auto result = foo(input);
}
auto end = std::chrono::high_resolution_clock::now();
auto res = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Elapsed time: " << res << std::endl;
}

{
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 100'000; ++i)
{
auto result = baz(input);
}
auto end = std::chrono::high_resolution_clock::now();
auto res = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Elapsed time: " << res << std::endl;
}
return 0;
}

Results when compiled and ran as Release x64

Elapsed time: 3065
Elapsed time: 29

Two orders of magnitude...



Related Topics



Leave a reply



Submit