Unicode Processing in C++

C programming: How to program for Unicode?

Note that this is not about "strict unicode programming" per se, but some practical experience.

What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).

Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.

When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).

We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).

Unicode Processing in C++


  • Use ICU for dealing with your data (or a similar library)
  • In your own data store, make sure everything is stored in the same encoding
  • Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want.
  • I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this.

Is it actually possible to store and process individual UTF-8 characters on C ? If so, how?

C and UTF-8 are still getting to know each other. In-other-words, IMO, C support for UTF-8 is scant.

Is it ... possible to store and process individual UTF-8 characters ...?

First step is to make certain "ايه الاخبار" is a UTF-8 encoded string. C supports this explicitly with u8"ايه الاخبار".

A UTF-8 string is a sequence of char. Each 1 to 4 char represents a Unicode character. A Unicode character needs at least 21-bits for encoding. Yet OP does not needs to convert a portion of string[] into a Unicode character as much as wants to segment that string on UTF-8 boundaries. This is readily found by looking for UTF-8 continuation bytes.

The following forms a 1 Unicode character encoded as a UTF-8 string with the accompanying terminating null character. Then that short string is printed.

char* string = u8"ايه الاخبار";
for (char *s = string; *s; ) {
printf("<");
char u[5];
char *p = u;
*p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
if ((*s & 0xC0) == 0x80) *p++ = *s++;
*p = 0;
printf("%s", u);
printf(">\n");
}

With the output viewed with a UTF8 aware screen:

<ا>
<ي>
<ه>
< >
<ا>
<ل>
<ا>
<خ>
<ب>
<ا>
<ر>

Process unicode string in C and Objective C

Your problem is that you start off with a UTF-16 encoded NSString and produce a sequence of UTF-8 encoded bytes. The number of code units required to represent a string in UTF-16 may not be equal to that number required to represent it in UTF-8, so the offsets in your two forms may not match - as you have found out.

Why are you using C to scan the string for matches in the first place? You might want to look at NSString's rangeOfCharacterFromSet:options:range: method which you can use to find the next occurrence of character from your set.

If you need to use C then convert your string into a sequence of UTF-16 words and use uint16_t on the C side.

HTH

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?


Which string container should I pick?

That is really up to you to decide, based on your own particular needs. Any of the choices you have presented will work, and they each have their own advantages and disadvantages. Generically, UTF-8 is good to use for storage and communication purposes, and is backwards compatible with ASCII. Whereas UTF-16/32 is easier to use when processing Unicode data.

std::wstring (don't really know much about it)

The size of wchar_t is compiler-dependent and even platform-dependent. For instance, on Windows, wchar_t is 2 bytes, making std::wstring usable for UTF-16 encoded strings. On other platforms, wchar_t may be 4 bytes instead, making std::wstring usable for UTF-32 encoded strings instead. That is why wchar_t/std::wstring is generally not used in portable code, and why char16_t/std::u16string and char32_t/std::u32string were introduced in C++11. Even char can have portability issues for UTF-8, since char can be either signed or unsigned at the descretion of the compiler vendors, which is why char8_t/std::u8string was introduced in C++20 for UTF-8.

Should I stick entirely to one of the above containers or change them when needed?

Use whatever containers suit your needs.

Typically, you should use one string type throughout your code. Perform data conversions only at the boundaries where string data enters/leaves your program. For instance, when reading/writing files, network communications, platform system calls, etc.

How to properly convert between them?

There are many ways to handle that.

C++11 and later have std::wstring_convert/std::wbuffer_convert. But these were deprecated in C++17.

There are 3rd party Unicode conversion libraries, such as ICONV, ICU, etc.

There are C library functions, platform system calls, etc.

Can I use non-english characters in string literals, when using UTF strings, such as Polish characters: ąćęłńśźż etc?

Yes, if you use appropriate string literal prefixes:

u8 for UTF-8.

L for UTF-16 or UTF-32 (depending on compiler/platform).

u16 for UTF-16.

u32 for UTF-32.

Also, be aware that the charset you use to save your source files can affect how the compiler interprets string literals. So make sure that whatever charset you choose to save your files in, like UTF-8, that you tell your compiler what that charset is, or else you may end up with the wrong string values at runtime.

What changes when we store UTF-8 encoded characters in std::string? Are they limited to one-byte ASCII characters or can they be multi-byte?

Each string character may be a single-byte, or be part of a multi-byte representation of a Unicode codepoint. It depends on the encoding of the string, and the character being encoded.

Just as std::wstring (when wchar_t is 2 bytes) and std::u16string can hold strings containing supplementary characters outside of the Unicode BMP, which require UTF-16 surrogates to encode.

When a string container contains a UTF encoded string, each "character" is just a UTF encoded codeunit. UTF-8 encodes a Unicode codepoint as 1-4 codeunits (1-4 chars in a std::string). UTF-16 encodes a codepoint as 1-2 codeunits (1-2 wchar_ts/char16_ts in a std::wstring/std::u16string). UTF-32 encodes a codepoint as 1 codeunit (1 char32_t in a std::u32string).

What happens when i do the following?

std::string s = u8"foo";
s += 'x';

Exactly what you would expect. A std::string holds char elements. Regardless of encoding, operator+=(char) will simply append a single char to the end of the std::string.

How can I distinguish UTF char[] and non-UTF char[] or std::string?

You would need to have outside knowledge of the string's original encoding, or else perform your own heuristic analysis of the char[]/std::string data to see if it conforms to a UTF or not.

What are differences between wchar_t and other multi-byte character types?

Byte size and UTF encoding.

char = ANSI/MBCS or UTF-8

wchar_t = DBCS, UTF-16 or UTF-32, depending on compiler/platform

char8_t = UTF-8

char16_t = UTF-16

char32_t = UTF-32

Is wchar_t character or wchar_t string literal capable of storing UTF encodings?

Yes, UTF-16 or UTF-32, depending on compiler/platform. In case of UTF-16, a single wchar_t can only hold a codepoint value that is in the BMP. A single wchar_t in UTF-32 can hold any codepoint value. A wchar_t string can encode all codepoints in either encoding.

How to properly manipulate UTF strings (such as toupper/tolower conversion) and be compatible with locales simultaneously?

That is a very broad topic, worthy of its own separate question by itself.

How can I use Unicode in Turbo C++?

As stated, Turbo C++ won't get you any straight access to Unicode. It is likely that it is so old that it can't even generate code that could be made to use the system's libraries (DLL), so - even by recreating header files by hand, you could not call wprintf which could output proper Unicode even on the arcane cmd terminal Microsoft ships with Windows to this day.

However, the default character encoding used in the cmd terminal supports some non-ASCII characters - which exactly will depend on the language (locale) configuration of your OS. (For example, for Western European languages, it is usually "cp-852" - although it can be CP 850, if your Windows is in English.

None of these legacy 8-bit character map encodings will include all ten digits as super-script - but you might have some available (CP 850 features "¹,²,³", for example).

So, you could check the terminal code page, and check on Wikipedia for their codes - you can inspect and change the current code page with the chcp command in the Windows terminal. If your Windows version supports UTF-8, which covers all printable Unicode characters, you have to type chcp 65001 in the terminal. (I don't know which Windows editions support that, nor which you are using.)

Once you manage to do that, all you need is to print the byte-sequences for the super-script digits in UTF-8, using the "\xHH" encoding for characters in a string (I am not sure if Turbo C++ will allow it. Otherwise, `printf ("%c%c", 0xHH, 0xHH) will work.)

For your convenience, I am attaching the codepoints and UTF-8 encodings for superscripts:

0x00B2: SUPERSCRIPT TWO - ² - utf-8 seq: b'\xc2\xb2'
0x00B3: SUPERSCRIPT THREE - ³ - utf-8 seq: b'\xc2\xb3'
0x00B9: SUPERSCRIPT ONE - ¹ - utf-8 seq: b'\xc2\xb9'
0x0670: ARABIC LETTER SUPERSCRIPT ALEF - ٰ - utf-8 seq: b'\xd9\xb0'
0x0711: SYRIAC LETTER SUPERSCRIPT ALAPH - ܑ - utf-8 seq: b'\xdc\x91'
0x2070: SUPERSCRIPT ZERO - ⁰ - utf-8 seq: b'\xe2\x81\xb0'
0x2071: SUPERSCRIPT LATIN SMALL LETTER I - ⁱ - utf-8 seq: b'\xe2\x81\xb1'
0x2074: SUPERSCRIPT FOUR - ⁴ - utf-8 seq: b'\xe2\x81\xb4'
0x2075: SUPERSCRIPT FIVE - ⁵ - utf-8 seq: b'\xe2\x81\xb5'
0x2076: SUPERSCRIPT SIX - ⁶ - utf-8 seq: b'\xe2\x81\xb6'
0x2077: SUPERSCRIPT SEVEN - ⁷ - utf-8 seq: b'\xe2\x81\xb7'
0x2078: SUPERSCRIPT EIGHT - ⁸ - utf-8 seq: b'\xe2\x81\xb8'
0x2079: SUPERSCRIPT NINE - ⁹ - utf-8 seq: b'\xe2\x81\xb9'
0x207A: SUPERSCRIPT PLUS SIGN - ⁺ - utf-8 seq: b'\xe2\x81\xba'
0x207B: SUPERSCRIPT MINUS - ⁻ - utf-8 seq: b'\xe2\x81\xbb'
0x207C: SUPERSCRIPT EQUALS SIGN - ⁼ - utf-8 seq: b'\xe2\x81\xbc'
0x207D: SUPERSCRIPT LEFT PARENTHESIS - ⁽ - utf-8 seq: b'\xe2\x81\xbd'
0x207E: SUPERSCRIPT RIGHT PARENTHESIS - ⁾ - utf-8 seq: b'\xe2\x81\xbe'
0x207F: SUPERSCRIPT LATIN SMALL LETTER N - ⁿ - utf-8 seq: b'\xe2\x81\xbf'
0xFC5B: ARABIC LIGATURE THAL WITH SUPERSCRIPT ALEF ISOLATED FORM - ﱛ - utf-8 seq: b'\xef\xb1\x9b'
0xFC5C: ARABIC LIGATURE REH WITH SUPERSCRIPT ALEF ISOLATED FORM - ﱜ - utf-8 seq: b'\xef\xb1\x9c'
0xFC5D: ARABIC LIGATURE ALEF MAKSURA WITH SUPERSCRIPT ALEF ISOLATED FORM - ﱝ - utf-8 seq: b'\xef\xb1\x9d'
0xFC63: ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM - ﱣ - utf-8 seq: b'\xef\xb1\xa3'
0xFC90: ARABIC LIGATURE ALEF MAKSURA WITH SUPERSCRIPT ALEF FINAL FORM - ﲐ - utf-8 seq: b'\xef\xb2\x90'
0xFCD9: ARABIC LIGATURE HEH WITH SUPERSCRIPT ALEF INITIAL FORM - ﳙ - utf-8 seq: b'\xef\xb3\x99'

(This was generated with the following Python snippet in interactive mode:)

import unicodedata
for i in range(0, 0x10ffff):
char = chr(i)
try:
name = unicodedata.name(char)
except ValueError:
pass
if "SUPERSCRIPT" not in name:
continue
print(f"0x{i:04X}: {name} - {char} - utf-8 seq: {char.encode('utf-8')}")

Processing Unicode characters in C++

This is either impossible, or it’s trivial. Here are the trivial approaches:

  • If no code point exceeds 127, then simply write it out in ASCII. Done.

  • If some code points exceed 127, then you must choose how to represent them in ASCII. A common strategy is to use XML syntax, as in α for U+03B1. This will take up to 8 ASCII characters for each trans-ASCII Unicode code point transcribed.

The impossible ones I leave as an excercise for the original poster. I won’t even mention the foolish-but-possible (read: stupid) approaches, as these are legion. Data destruction is a capital crime in data processing, and should be treated as such.

Note that I am assuming by ‘Unicode character’ you actually mean ‘Unicode code point’; that is, a programmer-visible character. For user-visible characters, you need ‘Unicode grapheme (cluster)’ instead.

Also, unless you normalize your text first, you’ll hate the world. I suggest NFD.


EDIT

After further clarification by the original poster, it seems that what he wants to do is very easily accomplished using existing tools without writing a new program. For example, this converts a certain set of Arabic characters from a UTF-8 input file into an ASCII output file:

$ perl -CSAD -Mutf8 -pe 'tr[ابتثجحخد][abttjhhd]' < input.utf8 > output.ascii

That only handles these code points:

U+0627 ‭ ا  ARABIC LETTER ALEF
U+0628 ‭ ب ARABIC LETTER BEH
U+0629 ‭ ة ARABIC LETTER TEH MARBUTA
U+062A ‭ ت ARABIC LETTER TEH
U+062B ‭ ث ARABIC LETTER THEH
U+062C ‭ ج ARABIC LETTER JEEM
U+062D ‭ ح ARABIC LETTER HAH
U+062E ‭ خ ARABIC LETTER KHAH
U+062F ‭ د ARABIC LETTER DAL

So you’ll have to extend it to whatever mapping you want.

If you want it in a script instead of a command-line tool, it’s also easy, plus then you can talk about the characters by name by setting up a mapping, such as:

 "\N{ARABIC LETTER ALEF}"   =>  "a",
"\N{ARABIC LETTER BEH}" => "b",
"\N{ARABIC LETTER TEH}" => "t",
"\N{ARABIC LETTER THEH}" => "t",
"\N{ARABIC LETTER JEEM}" => "j",
"\N{ARABIC LETTER HAH}" => "h",
"\N{ARABIC LETTER KHAH}" => "h",
"\N{ARABIC LETTER DAL}" => "d",

If this is supposed to be a component in a larger C++ program, then perhaps you will want to implement this in C++, possibly but not necessary using the ICU4C library, which includes transliteration support.

But if all you need is a simple conversion, I don’t understand why you would write a dedicated C++ program. Seems like way too much work.



Related Topics



Leave a reply



Submit