Converting Narrow String to Wide String

converting narrow string to wide string

You should do this :

inline std::wstring convert( const std::string& as )
{
// deal with trivial case of empty string
if( as.empty() ) return std::wstring();

// determine required length of new string
size_t reqLength = ::MultiByteToWideChar( CP_UTF8, 0, as.c_str(), (int)as.length(), 0, 0 );

// construct new string of required length
std::wstring ret( reqLength, L'\0' );

// convert old string to new string
::MultiByteToWideChar( CP_UTF8, 0, as.c_str(), (int)as.length(), &ret[0], (int)ret.length() );

// return new string ( compiler should optimize this away )
return ret;
}

This expects the std::string to be UTF-8 (CP_UTF8), when you have another encoding replace the codepage.

Another way could be :

inline std::wstring convert( const std::string& as )
{
wchar_t* buf = new wchar_t[as.size() * 2 + 2];
swprintf( buf, L"%S", as.c_str() );
std::wstring rval = buf;
delete[] buf;
return rval;
}

Why mask a char with 0xFF when converting narrow string to wide string?

Masking with 0xFF reduces any negative values into the range 0-255.

This is reasonable if, for example, your platform's char is an 8-bit signed type representing ISO-8859-1 characters, and your wchar_t is representing UCS-2, UTF-16 or UCS-4.


Without this correction (or something similar, such as casting to unsigned char or std::byte), you would find that characters are sign-extended when promoted to the wider type.

Example: 0xa9 (© in Unicode and Latin-1, -87 in signed 8-bit) would become \uffa9 instead of \u00a9.


I think it's clearer to convert the char to an unsigned char - that works for any size char, and conveys the intent better. You can change that expression directly, or create a codecvt subclass that gives a name to what you're doing.

Here's how to write and use a minimal codecvt (for narrow → wide conversion only):

#include <codecvt>
#include <locale>
#include <string>

class codecvt_latin1 : public std::codecvt<wchar_t,char,std::mbstate_t>
{
protected:
virtual result do_in(std::mbstate_t&,
const char* from,
const char* from_end,
const char*& from_next,
wchar_t* to,
wchar_t* to_end,
wchar_t*& to_next) const override
{
while (from != from_end && to != to_end)
*to++ = (unsigned char)*from++;
from_next = from;
to_next = to;
return result::ok;
}
};

std::wstring convert(const std::string& input)
{
using codecvt_utf8 = std::codecvt_utf8<wchar_t>;
try {
return std::wstring_convert<codecvt_utf8>().from_bytes(input);
} catch (std::range_error&) {
return std::wstring_convert<codecvt_latin1>{}.from_bytes(input);
}
}
#include <iostream>

int main()
{
std::locale::global(std::locale{""});

// UTF-8: £© おはよう
std::wcout << convert(u8"\xc2\xa3\xc2\xa9 おはよう") << std::endl;
// Latin-1: 壩
std::wcout << convert("\xc2\xa3\xa9") << std::endl;
}

Output:

£© おはよう
壩

Issue when converting utf16 wide std::wstring to utf8 narrow std::string for rare characters

The problem is that std::wstring_convert<std::codecvt_utf8<wchar_t>> converts from UCS-2, not from UTF-16. Characters inside of the BMP (U+0000..U+FFFF) have identical encodings in both UCS-2 and UTF-16 and so will work, but characters outside of the BMP (U+FFFF..U+10FFFF), such as your Emoji, do not exist in UCS-2 at all. This means the conversion doesn't understand the character and produces incorrect UTF-8 bytes (technically, it's converted each half of the UTF-16 surrogate pair into a separate UTF-8 character).

You need to use std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> instead.

How to convert wstring into string?

Here is a worked-out solution based on the other suggestions:

#include <string>
#include <iostream>
#include <clocale>
#include <locale>
#include <vector>

int main() {
std::setlocale(LC_ALL, "");
const std::wstring ws = L"ħëłlö";
const std::locale locale("");
typedef std::codecvt<wchar_t, char, std::mbstate_t> converter_type;
const converter_type& converter = std::use_facet<converter_type>(locale);
std::vector<char> to(ws.length() * converter.max_length());
std::mbstate_t state;
const wchar_t* from_next;
char* to_next;
const converter_type::result result = converter.out(state, ws.data(), ws.data() + ws.length(), from_next, &to[0], &to[0] + to.size(), to_next);
if (result == converter_type::ok or result == converter_type::noconv) {
const std::string s(&to[0], to_next);
std::cout <<"std::string = "<<s<<std::endl;
}
}

This will usually work for Linux, but will create problems on Windows.

How to convert a std::string to L data type

auto is not a data-type. It is a placeholder that gets deduced depending on the initializer used.

In your case, the initializer is a wide string-literal of type const wchar_t[size], which decays to const wchar_t* when used to initialize the variable.

A wide string can be stored in a std::wstring.

How to convert a std::string (a narrow string) to a wide string depends on the source's character encoding.

Anyway, there are many others who asked that too:

C++ Convert string (or char*) to wstring (or wchar_t*)

Inserting narrow character string to std::basic_ostreamwchar_t

I have just checked in Visual Studio 2015 and you are right. The chars are only widened to wchar_ts without any conversion. It seems to me that you will have to convert the narrow character string into wide character string yourself. There several ways how you can do it, some of it have been already suggested.

Here I propose that you can use pure C++ facilities to do it, assuming your C++ compiler and standard library is complete enough (Visual Studio, or GCC on Linux (and only there)):

void clear_mbstate (std::mbstate_t & mbs);

void
towstring_internal (std::wstring & outstr, const char * src, std::size_t size,
std::locale const & loc)
{
if (size == 0)
{
outstr.clear ();
return;
}

typedef std::codecvt<wchar_t, char, std::mbstate_t> CodeCvt;
const CodeCvt & cdcvt = std::use_facet<CodeCvt>(loc);
std::mbstate_t state;
clear_mbstate (state);

char const * from_first = src;
std::size_t const from_size = size;
char const * const from_last = from_first + from_size;
char const * from_next = from_first;

std::vector<wchar_t> dest (from_size);

wchar_t * to_first = &dest.front ();
std::size_t to_size = dest.size ();
wchar_t * to_last = to_first + to_size;
wchar_t * to_next = to_first;

CodeCvt::result result;
std::size_t converted = 0;
while (true)
{
result = cdcvt.in (
state, from_first, from_last,
from_next, to_first, to_last,
to_next);
// XXX: Even if only half of the input has been converted the
// in() method returns CodeCvt::ok. I think it should return
// CodeCvt::partial.
if ((result == CodeCvt::partial || result == CodeCvt::ok)
&& from_next != from_last)
{
to_size = dest.size () * 2;
dest.resize (to_size);
converted = to_next - to_first;
to_first = &dest.front ();
to_last = to_first + to_size;
to_next = to_first + converted;
continue;
}
else if (result == CodeCvt::ok && from_next == from_last)
break;
else if (result == CodeCvt::error
&& to_next != to_last && from_next != from_last)
{
clear_mbstate (state);
++from_next;
from_first = from_next;
*to_next = L'?';
++to_next;
to_first = to_next;
}
else
break;
}
converted = to_next - &dest[0];

outstr.assign (dest.begin (), dest.begin () + converted);
}

void
clear_mbstate (std::mbstate_t & mbs)
{
// Initialize/clear mbstate_t type.
// XXX: This is just a hack that works. The shape of mbstate_t varies
// from single unsigned to char[128]. Without some sort of initialization
// the codecvt::in/out methods randomly fail because the initial state is
// random/invalid.
std::memset (&mbs, 0, sizeof (std::mbstate_t));
}

This function is part of log4cplus library and it works. It uses the codecvt facet to do the conversion. You have to give it appropriately set up locale.

Visual studio might have issues giving you appropriately set up locale for GB2312. You might have to use _setmbcp() to for it to work. See "double byte character sequence conversion issue in Visual Studio 2015" for details.

I want to convert std::string into a const wchar_t *

If you have a std::wstring object, you can call c_str() on it to get a wchar_t*:

std::wstring name( L"Steve Nash" );
const wchar_t* szName = name.c_str();

Since you are operating on a narrow string, however, you would first need to widen it. There are various options here; one is to use Windows' built-in MultiByteToWideChar routine. That will give you an LPWSTR, which is equivalent to wchar_t*.

Wide to narrow characters

The most native way is std::ctype<wchar_t>::narrow(), but that does little more than std::copy as gishu suggested and you still need to manage your own buffers.

If you're not trying to perform any translation but just want a one-liner, you can do std::string my_string( my_wstring.begin(), my_wstring.end() ).

If you want actual encoding translation, you can use locales/codecvt or one of the libraries from another answer, but I'm guessing that's not what you're looking for.



Related Topics



Leave a reply



Submit