Utf8 To/From Wide Char Conversion in Stl

UTF8 vs Wide Char?

If by "wide char", you are referring to wchar_t, then you have to take into account that it is 16-bit (using UCS-2 or UTF-16) on some platforms, but is 32-bit (using UTF-32) on other platforms. So asking how to convert to/from "wide char", you first have to define what "wide char" actually means. Proper 16-bit/32-bit data types need to be used when dealing with UTF-16/32.

Pretty much any Unicode library, including utf8-cpp and ICU, has functions for converting between UTF8<->UTF16 and UTF8<->UTF32 using appropriate data types and not relying on wchar_t.

STL and UTF-8 file input/output. How to do it?

Use std::codecvt_facet template to perform the conversion.

You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.

#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.pubimbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;

Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system.

How to convert LPWSTR to char * with UTF-8 encoding

You should be able to use QString's functions. For example

QString str = QString::fromUtf16((const ushort*)argvW[0]);
::MessageBoxW(0, (const wchar_t*)str.utf16(), 0, 0);

When using WideCharToMultiByte, pass zero for output buffer and output buffer's length. This will tell you how many characters you need for output buffer. For example:

const wchar_t* wbuf = argvW[0];
int len = WideCharToMultiByte(CP_UTF8, 0, wbuf, -1, 0, 0, 0, 0);

std::string buf(len, 0);

WideCharToMultiByte(CP_UTF8, 0, wbuf, -1, &buf[0], len,0,0);
QString utf8;
utf8 = QString::fromUtf8(buf.c_str());
::MessageBoxW(0, (const wchar_t*)utf8.utf16(), 0, 0);

The same information should be available in QCoreApplication::arguments. For example, run this code with Unicode argument and see the output:

int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
QString filename = QString::fromUtf8("ελληνική.txt");
QFile fout(filename);
if (fout.open(QIODevice::WriteOnly | QIODevice::Text))
{
QTextStream oss(&fout);
oss.setCodec("UTF-8");
oss << filename << "\n";
QStringList list = a.arguments();
for (int i = 0; i < list.count(); i++)
oss << list[i] << "\n";
}
fout.close();
return a.exec();
}

Note that in above example the filename is internally converted to UTF-16, that's done by Qt. WinAPI uses UTF-16, not UTF-8

UTF8 Conversion

Something like that:

extern void someFunctionThatAcceptsUTF8(const char* utf8);

const char* ss1 = "string in system default multibyte encoding";

someFunctionThatAcceptsUTF8( w2u( a2w(ss1) ) ); // that conversion you need:
// a2w: "ansi" -> widechar string
// w2u: widechar string -> utf8 string.

You just need to grab and include this file:
http://code.google.com/p/tiscript/source/browse/trunk/sdk/include/aux-cvt.h

It should work on Builder just fine.

How to convert UTF-8 std::string to UTF-16 std::wstring?

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.

std::wstring utf8_to_utf16(const std::string& utf8)
{
std::vector<unsigned long> unicode;
size_t i = 0;
while (i < utf8.size())
{
unsigned long uni;
size_t todo;
bool error = false;
unsigned char ch = utf8[i++];
if (ch <= 0x7F)
{
uni = ch;
todo = 0;
}
else if (ch <= 0xBF)
{
throw std::logic_error("not a UTF-8 string");
}
else if (ch <= 0xDF)
{
uni = ch&0x1F;
todo = 1;
}
else if (ch <= 0xEF)
{
uni = ch&0x0F;
todo = 2;
}
else if (ch <= 0xF7)
{
uni = ch&0x07;
todo = 3;
}
else
{
throw std::logic_error("not a UTF-8 string");
}
for (size_t j = 0; j < todo; ++j)
{
if (i == utf8.size())
throw std::logic_error("not a UTF-8 string");
unsigned char ch = utf8[i++];
if (ch < 0x80 || ch > 0xBF)
throw std::logic_error("not a UTF-8 string");
uni <<= 6;
uni += ch & 0x3F;
}
if (uni >= 0xD800 && uni <= 0xDFFF)
throw std::logic_error("not a UTF-8 string");
if (uni > 0x10FFFF)
throw std::logic_error("not a UTF-8 string");
unicode.push_back(uni);
}
std::wstring utf16;
for (size_t i = 0; i < unicode.size(); ++i)
{
unsigned long uni = unicode[i];
if (uni <= 0xFFFF)
{
utf16 += (wchar_t)uni;
}
else
{
uni -= 0x10000;
utf16 += (wchar_t)((uni >> 10) + 0xD800);
utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
}
}
return utf16;
}


Related Topics



Leave a reply



Submit