C++ & Boost: Encode/Decode Utf-8

C++ & Boost: encode/decode UTF-8

Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}

Usage:

wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);

UTF-8 names with boost::filestream under MinGW

I found 2 solutions using another library where both have their drawbacks.

  1. Pathie (Docu) It looks like a full replacement of boost::filesystem providing UTF8 aware streams and path handling as well as symlink creation and other file/folder operations. Really cool is the builtin support for getting special directories (temp, HOME, programs folder and many more)

    Drawback: Only works as a dynamic library as the static build has bugs. Also might be overkill if you already use boost.
  2. Boost.NoWide (Docu) Provides alternatives to almost all file and stream handlers to support UTF8 on windows and falls back to standard functions on others. The filestreams accept UTF8 encoded values (for the name) and it uses boost itself.

    Drawback: No path handling and does not accept bfs::path or wide strings (bfs::path internal format on Windows is UTF16) so a patch would be required, although it is simple. Also requires a build for windows if you want to use std::cout etc with UTF8 strings (yes that works directly!)

    Another cool thing: It provides a class to convert the argc/argv to UTF8 on windows.

Unicode to UTF-8 in C++

Boost.Locale has also functions for encoding conversions:

#include <boost/locale.hpp>

int main() {
unsigned int point = 0x5e9;
std::string utf8 = boost::locale::conv::utf_to_utf<char>(&point, &point + 1);
assert(utf8.length() == 2);
assert(utf8[0] == '\xD7');
assert(utf8[1] == '\xA9');
}

Encoding decoded urls in c++

In POSIX you can print UTF8 string directly:

std::string utf8 = "\xc3\xb6"; // or just u8"ö"
printf(utf8);

In Windows, you have to convert to UTF16. Use wchar_t instead of char16_t, even though char16_t is supposed to be the right one. They are both 2 bytes per character in Windows.

You want convert.from_bytes to convert from UTF8, instead of convert.to_bytes which converts to UTF8.

Printing Unicode in Windows console is another headache. See relevant topics.

Note that std::wstring_convert is deprecated and has no replacement as of now.

#include <iostream>
#include <string>
#include <codecvt>
#include <windows.h>

int main()
{
std::string utf8 = "\xc3\xb6";

std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
std::wstring utf16 = convert.from_bytes(utf8);

MessageBox(0, utf16.c_str(), 0, 0);
DWORD count;
WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), utf16.c_str(), utf16.size(), &count, 0);

return 0;
}

Encoding/Decoding URL

"URL safe characters" don't need encoding. All other characters, including non-ASCII characters, should be encoded. Example:

std::string encode_url(const std::string& s)
{
const std::string safe_characters =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~";
std::ostringstream oss;
for(auto c : s) {
if (safe_characters.find(c) != std::string::npos)
oss << c;
else
oss << '%' << std::setfill('0') << std::setw(2) <<
std::uppercase << std::hex << (0xff & c);
}
return oss.str();
}

std::string decode_url(const std::string& s)
{
std::string result;
for(std::size_t i = 0; i < s.size(); i++) {
if(s[i] == '%') {
try {
auto v = std::stoi(s.substr(i + 1, 2), nullptr, 16);
result.push_back(0xff & v);
} catch(...) { } //handle error
i += 2;
}
else {
result.push_back(s[i]);
}

}
return result;
}

UTF-16BE to UTF-8 using Boost.Locale yields garbage

It looks like in your case utf_to_utf is processing the input as if it was little-endian UTF16.

Taking the first 4 bytes:

You meant 00 72 00 101 to encode U+0048 U+0064.

When interpreted under the opposite endianness that encodes U+4800 U+6400.

When that's converted to UTF-8 it results in the bytes e4 a0 80 e6 94 80.

Representing those as decimal gives 228 160 128 230 148 128, which are the first values of your "garbage".



Related Topics



Leave a reply



Submit