Convert from std::wstring to std::string
std::string
simply holds an array of bytes. It does not hold information about the encoding in which these bytes are supposed to be interpreted, nor do the standard library functions or std::string
member functions generally assume anything about the encoding. They handle the contents as just an array of bytes.
Therefore when the contents of a std::string
need to be presented, the presenter needs to make some guess about the intended encoding of the string, if that information is not provided in some other way.
I am assuming that the encoding you intend to convert to is UTF8, given that you are using std::codecvt_utf8
.
But if you are using Virtual Studio, the debugger simply assumes one specific encoding, at least by default. That encoding is not UTF8, but I suppose probably code page 1252.
As verification, python gives the following:
>>> '日本'.encode('utf8').decode('cp1252')
'日本'
Your string does seem to be the UTF8 encoding of 日本
interpreted as if it was cp1252 encoded.
Therefore the conversion seems to have worked as intended.
As mentioned by @MarkTolonen in the comments, the encoding to assume for a string variable can be specified to UTF8 in the Visual Studio debugger with the s8
specifier, as explained in the documentation.
How to convert wstring into string?
Here is a worked-out solution based on the other suggestions:
#include <string>
#include <iostream>
#include <clocale>
#include <locale>
#include <vector>
int main() {
std::setlocale(LC_ALL, "");
const std::wstring ws = L"ħëłlö";
const std::locale locale("");
typedef std::codecvt<wchar_t, char, std::mbstate_t> converter_type;
const converter_type& converter = std::use_facet<converter_type>(locale);
std::vector<char> to(ws.length() * converter.max_length());
std::mbstate_t state;
const wchar_t* from_next;
char* to_next;
const converter_type::result result = converter.out(state, ws.data(), ws.data() + ws.length(), from_next, &to[0], &to[0] + to.size(), to_next);
if (result == converter_type::ok or result == converter_type::noconv) {
const std::string s(&to[0], to_next);
std::cout <<"std::string = "<<s<<std::endl;
}
}
This will usually work for Linux, but will create problems on Windows.
How is const std::wstring encoded and how to change to UTF-16
As clarified in the comments, the source .cpp
file is UTF-8 encoded. Without a BOM, and without an explicit /source-charset:utf-8
switch, the Visual C++ compiler defaults to assuming the source file is saved in the active codepage encoding. From the Set Source Character Set documentation:
By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you specify a character set name or code page by using the /source-charset option.
The UTF-8 encoding of äöüß
is C3 A4 C3 B6 C3 BC C3 9F
, and therefore the line:
std::wstring wstr = L"äöüß";
is seen by the compiler as:
std::wstring wstr = L"\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x9F"`;
Assuming the active codepage to be the usual Windows-1252, the (extended) characters map as:
win-1252 char unicode
\xC3 Ã U+00C3
\xA4 ¤ U+00A4
\xB6 ¶ U+00B6
\xBC ¼ U+00BC
\x9F Ÿ U+0178
Therefore L"\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x9F"
gets translated to:
std::wstring wstr = L"\u00C3\u00A4\u00C3\u00B6\u00C3\u00BC\u00C3\u0178"`;
To avoid such (mis)translation, Visual C++ needs to be told that the source file is encoded as UTF-8 by passing an explicit /source-charset:utf-8
(or /utf-8
) compiler switch. For CMake based projects, this can be done using add_compile_options
as shown at Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819.
UTF8 data to std::string or std::wstring
Storing utf-8 in the std::string
is no more than storing sequence of bytes in "vector". The std::string
is not aware of any encoding stuff whatsoever, and any member function like find
or <algorithm>
function like std::find
would not work once you need to work beyond standard ASCII. So it is up to you how you gonna handle this situation, you can try and convert your input (L"Ñ"
) to utf-8 sequence and try to find it in std::string
or you can convert your string
to wstring
and work directly on it. IMHO, in your case when you have to manipulate (search, extract words, split by letters or replace, and all this beyond ASCII range) the input you better stick to wstring
and before posting it to client convert to utf-8 std::string
EDIT001: As of std::codecvt_utf8
mentioned above in a comment and my comment about performance concerns. Here is the test
std::wstring foo(const std::string& input)
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
return converter.from_bytes(input.c_str());
}
std::wstring baz(const std::string& input)
{
std::wstring retVal;
auto targetSize = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), static_cast<int>(input.size()), NULL, 0);
retVal.resize(targetSize);
auto res = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), static_cast<int>(input.size()),
const_cast<LPWSTR>(retVal.data()), targetSize);
if(res == 0)
{
// handle error, throw, do something...
}
return retVal;
}
int main()
{
std::string input = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut "
"labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco "
"laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in "
"voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat "
"cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";
{
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 100'000; ++i)
{
auto result = foo(input);
}
auto end = std::chrono::high_resolution_clock::now();
auto res = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Elapsed time: " << res << std::endl;
}
{
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < 100'000; ++i)
{
auto result = baz(input);
}
auto end = std::chrono::high_resolution_clock::now();
auto res = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Elapsed time: " << res << std::endl;
}
return 0;
}
Results when compiled and ran as Release x64
Elapsed time: 3065
Elapsed time: 29
Two orders of magnitude...
Related Topics
What's a Use Case for Overloading Member Functions on Reference Qualifiers
Why How to Not Brace Initialize a Struct Derived from Another Struct
Why Stdfax.H Should Be the First Include on Mfc Applications
Error: Invalid Initialization of Non-Const Reference of Type 'Int&' from an Rvalue of Type 'Int'
Problems with Singleton Pattern
Copy an Cv::Mat Inside a Roi of Another One
Determining If a Number Is Prime
Workaround for Template Argument Deduction in Non-Deduced Context
Child Process Receives Parent's Sigint
How to Detect Double Precision Floating Point Overflow and Underflow
Forward Declare a Standard Container
How to Reset Std::Cin When Using It
Variable Declarations in Header Files - Static or Not