Is Codecvt Not a Std Header

Is codecvt not a std header?

The reason why GCC rejects this code is simple: libstdc++ doesn't support <codecvt> yet.

The C++11 support status page confirms this:

22.5 Standard code conversion facets N

Codecvt doesn't work in gcc

Only file streams are required to use std::codecvt<...> and there is no requirement that any of the standard stream objects is implemented in terms of file streams. There are reasons for the implementers of either choice. Dinkumware's implementation uses <stdio.h> for most of its operations and it makes sense to use the same implementation under the hood in this case. libstdc++ avoids some overheads and directly accesses a buffer shared between the standard C and C++ streams and, thus, uses a different stream implementation.

When using file streams use of the std::codecvt<...> facets should be consistent.

Deprecated header codecvt replacement

std::codecvt template from <locale> itself isn't deprecated. For UTF-8 to UTF-16, there is still std::codecvt<char16_t, char, std::mbstate_t> specialization.

However, since std::wstring_convert and std::wbuffer_convert are deprecated along with the standard conversion facets, there isn't any easy way to convert strings using the facets.

So, as Bolas already answered: Implement it yourself (or you can use a third party library, as always) or keep using the deprecated API.

Visual Studio C++ 2015 std::codecvt with char16_t or char32_t

Old question, but for future reference: this is a known bug in Visual Studio 2015, as explained in the latest post (January 7th 2016) in this thread of MSDN Social.

The workaround for your example looks like this (I implemented your method as a free function for simplicity):

#include <codecvt>
#include <locale>
#include <string>
#include <iostream>

#if _MSC_VER >= 1900

std::string utf16_to_utf8(std::u16string utf16_string)
{
std::wstring_convert<std::codecvt_utf8_utf16<int16_t>, int16_t> convert;
auto p = reinterpret_cast<const int16_t *>(utf16_string.data());
return convert.to_bytes(p, p + utf16_string.size());
}

#else

std::string utf16_to_utf8(std::u16string utf16_string)
{
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> convert;
return convert.to_bytes(utf16_string);
}

#endif

int main()
{
std::cout << utf16_to_utf8(u"Élémentaire, mon cher Watson!") << std::endl;

return 0;
}

Hopefully, the problem will be fixed in future releases, otherwise the #if condition will need refining.
UPDATE: nope, not fixed in VS 2017. Therefore, I've updated the preprocessor conditional to >= 1900 (initially was == 1900).

Why is std::codecvt only used by file I/O streams?

The std::codecvt facet was originally intended to handle I/O conversions between disk and memory character representation. Quoted from paragraph 39.4.6 of Bjarne Stroustrup's The C++ Programming Language fourth edition:

Sometimes, the representation of characters stored in a file differs from the desired representation of those same characters in main memory. ... the codecvt facet provides a mechanism for converting characters from one representation to another as they are read or written.

The intended purpose was thus to use std::codecvt only for adapting characters between file (disk) and memory, which partly answers your question:

Why is std::codecvt only used by file I/O streams?

From the docs we see that:

All file I/O operations performed through std::basic_fstream<CharT> use the std::codecvt<CharT, char, std::mbstate_t> facet of the locale imbued in the stream.

Which then answers the question why std::ofstream (uses a file-based streambuffer) and std::cout (linked to standard output FILE stream) invokes std::codecvt.

Now, to use the high-level std::ostream interface you need to provide an underlying streambuf. The std::ofstream provides a filebuf and the std::ostringstream provides a stringbuf (which is not linked to the use of std::codecvt). See this post over the streams, which also highlights the following:

...in the case of ofstream, there are also a few extra functions which forward to additional functions in the filebuf interface

But, to invoke the character conversion functionality of a std::codecvt when you have a std::ostringstream which is a std::ostream with an underlying std::basic_streambuf you can use, as indicated in your post, the std::wbuffer_convert.

You have only used the std::wstring_convert in your second update and not the std::wbuffer_convert.

When using the std::wbuffer_convert you can wrap the original std::ostringstream with a std::ostream as follows:

// Create a std::ostringstream
auto osstream = std::ostringstream{};

// Create the wrapper for the ostringstream
std::wbuffer_convert<custom_facet, char> wrapper(osstream.rdbuf());

// Now create a std::ostream which uses the wrapper to send data to
// the original std::ostringstream
std::ostream normal_ostream(&wrapper);
normal_ostream << "test\n";

// Flush the stream to invoke the conversion
normal_ostream << std::flush;

// Check the invocation_counter
std::cout << "invocation_counter after wrapping std::ostringstream with "
"std::wbuffer_convert = "
<< invocation_counter << "\n";

Together with the complete example here, the output would be:

invocation_counter start of test1 = 0
invocation_counter after std::ofstream = 1
> test printed to std::cout
invocation_counter after std::cout = 2
invocation_counter after std::ostringstream (should not have changed)= 2
ic after test1 = 2
invocation_counter after std::ostringstream with std::wstring_convert = 3
ic after test2 = 3
invocation_counter after wrapping std::ostringstream with std::wbuffer_convert = 4
ic after test3 = 4

Conclusion

std::codecvt was intended for converting between disk and memory representation. That is why the std::codecvt implementation is only called with streams using an underlying filebuf such as std::ofstream and std::cout.
However, a stream using an underlying stringbuf can be wrapped using std::wbuffer_convert into a std::ostream instance which would then invoke the underlying std::codecvt.

Where to put std::wstring_convertstd::codecvt_utf8wchar_t?

I wouldn't store the std::wstring_convert in a global variable because that's not thread-safe and doesn't buy you much. There might be a performance hit with instantiating std::wstring_convert everytime you need it, but that should not be your primary concern at the beginning (premature optimization).

So I'd just wrap that thing into functions:

std::wstring utf8_to_wstr( const std::string& utf8 ) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> wcu8;
return wcu8.from_bytes( utf8 );
}

std::string wstr_to_utf8( const std::wstring& utf16 ) {
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> wcu8;
return wcu8.to_bytes( utf16 );
}

You have to catch std::range_error exception somewhere. It can be thrown by std::wstring_convert if the conversion fails for some reason (invalid code points, etc.).

If you hit performance bottlenecks regarding string conversions later, you can still instantiate std::wstring_convert directly at critical points in your code, e. g. outside of a long running loop that converts many strings.



Related Topics



Leave a reply



Submit