What Is The Linux Equivalent Of: Multibytetowidechar & Widechartomultibyte

Why use MultiByteToWideChar and WideCharToMultiByte at the same time?

This code fragment first converts the string from the a multibyte representation using the system default code page to Unicode, then converts it to the UTF-8 multibyte representation. Thus, it converts text in the default code page to UTF-8 representation.

The code is fragile, in that it assumes the UTF-8 version will only double in size (this probably works most of the time, but the worse case is that a single byte in the default code page may map to 4 bytes in UTF-8).

Multiplatform way to convert between std::string and std::wstring

Basically using the <cstdlib> you can get away with a similar implementation to what you already have, as mentioned by Joachim Pileborg. As long as you have set the locale to whatever you want it to be (for example: setlocale( LC_ALL, "en_US.utf8" );

MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0) => mbstowcs(nullptr, data(str), size(str))

MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed) => mbstowcs(data(wstrTo), data(str), size(str))

WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL) => wcstombs(nullptr, data(wstr), size(wstr))

WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL) => wcstombs(data(strTo), data(wstr), size(wstr))

EDIT:

c++11 requires strings to be allocated contiguously, which may be important if you are compiling cross-platform as previous standards did not require string to be allocated contiguously. Previously calling &str[0], &strTo[0], &wstr[0], or &wstrTo[0] could have caused problems.

Since c++17 is now the accepted standard, I've improved my suggested substitutions to use data rather than dereferencing the front of the strings.

How do I convert a char* to a char* that is UTF-8 encoded?

Look into iconv(3). that's the api you want. You'll need -liconv.

WideCharToMultiByte() vs. wcstombs()

In a nutshell: the WideCharToMultiByte function exposes the encodings/code pages used for the conversion in the parameter list, while wcstombs does not. This is a major PITA, as the standard does not define what encoding is to be used to produce the wchar_t, while you as a developer certainly need to know what encoding you are converting to/from.

Apart from that, WideCharToMultiByte is of course a Windows API function and is not available on any other platform.

Therefore I would suggest using WideCharToMultiByte without a moment's thought if your application is not specifically written to be portable to non-Windows OSes. Otherwise, you might want to wrestle with wcstombs or (preferably IMHO) look into using a full-feature portable Unicode library such as ICU.

WideCharToMultiByte in QB64

Some more args need to be passed with the BYVAL keyword:

FUNCTION MultiByteToWideChar& (BYVAL codePage~&, BYVAL dwFlags~&, lpszMbstring$, BYVAL byteCount&, lpwszWcstring$, BYVAL wideCount&)
FUNCTION WideCharToMultiByte& (BYVAL codePage~&, BYVAL dwFlags~&, lpWideString$, BYVAL ccWideChar%, lpMultiByte$, BYVAL multibyte%, BYVAL defaultchar&, BYVAL usedchar&)

Aside from that, the length of STRING * 260 is always 260, regardless of any value stored. This means Filename = Filename + CHR$(0) won't work as intended, not that either of MultiByteToWideChar or WideCharToMultiByte require null-terminated input (that's why the byteCount and ccWideChar params exist; sometimes you only want to operate on a part of a string).

Worse, even if you use _MEMFILL to set all bytes of Filename to 0 to allow you to deal with things using ASCIIZ strings, INPUT and LINE INPUT will fill any remaining bytes not explicitly entered into Filename with CHR$(32) (i.e. a blank space as if you pressed the spacebar). For example, if you enter "Hello", there would be 5 bytes for the string entered and 255 bytes of character code 32 (or &H20 if you prefer hexadecimal).

To save yourself this terrible headache ("hello world.bas" is a valid filename!), you'll want to use STRING, not STRING * 260. If the length is greater than 260, you should probably print an error message. Whether you allow a user to enter a new filename or not after that is up to you.

You'll also want to use the return value of MultiByteToWideChar since it is the number of characters in NewFilename:

DIM Filename AS STRING
DIM NewFilename AS STRING * 260
DIM MultiByte AS STRING * 260
...

' Note: LEN(NewFilename) = 260 (**always**)
' This is why the number of wide chars written
' is saved.
NewFilenameLen = MultiByteToWideChar(0, 0, Filename, LEN(Filename), NewFilename, LEN(NewFilename))

...

' Note: LEN(MultiByte) = 260 (**always**)
x = WideCharToMultiByte(65001, 0, NewFilename, NewFilenameLen, MultiByte, LEN(MultiByte), 0, 0)

...

ICU C++ Converting Encodings

You can use ICU, but you may find iconv() sufficient, which is a lot simpler to set up and operate (and it's part of Posix, and easily available for Windows).

With either library, you have to convert your unicode string to a wide string. In iconv() that target is called WCHAR_T. Once you have a wide char, you can use it directly in Windows.

In Linux, you can either proceed to use wcstombs() to transform the wide character into the system's (and locale's) narrow character multibyte encoding (don't forget setlocale(LC_CTYPE, "");), or, alternatively, if you are sure that you want UTF-8 rather than the system's encoding, you can transform from your original string to UTF-8 directly (also with either library).

Maybe you'll find this post of mine to provide some background.



Related Topics



Leave a reply



Submit