How to Use Unicode in C++

C programming: How to program for Unicode?

Note that this is not about "strict unicode programming" per se, but some practical experience.

What we did at my company was to create a wrapper library around IBM's ICU library. The wrapper library has a UTF-8 interface and converts to UTF-16 when it is necessary to call ICU. In our case, we did not worry too much about performance hits. When performance was an issue, we also supplied UTF-16 interfaces (using our own datatype).

Applications could remain largely as-is (using char), although in some cases they need to be aware of certain issues. For instance, instead of strncpy() we use a wrapper which avoids cutting off UTF-8 sequences. In our case, this is sufficient, but one could also consider checks for combining characters. We also have wrappers for counting the number of codepoints, the number of graphemes, etc.

When interfacing with other systems, we sometimes need to do custom character composition, so you may need some flexibility there (depending on your application).

We do not use wchar_t. Using ICU avoids unexpected issues in portability (but not other unexpected issues, of course :-).

Unicode stored in C char

There is no magic here - The C language gives you acess to the raw bytes, as they are stored in the computer memory.
If your terminal is using utf-8 (which is likely), non-ASCII chars take more than one byte in memory. When you display then again, is our terminal code which converts these sequences into a single displayed character.

Just change your code to print the strlen of the strings, and you will see what I mean.

To properly handle utf-8 non-ASCII chars in C you have to use some library to handle them for you, like glib, qt, or many others.

Printing a Unicode Symbol in C

Two problems: first of all, a wchar_t must be printed with %lc format, not %c. The second one is that unless you call setlocale the character set is not set properly, and you probably get ? instead of your star. The following code seems to work though:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
setlocale(LC_CTYPE, "");
wchar_t star = 0x2605;
wprintf(L"%lc\n", star);
}

And for ncurses, just initialize the locale before the call to initscr.

How to print Unicode character in C++?

To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.

// both of these assume that the character can be represented with
// a single char in the execution encoding
char b = '\u0444';
char a = 'ф'; // this line additionally assumes that the source character encoding supports this character

Printing such characters out depends on what you're printing to. If you're printing to a Unix terminal emulator, the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do the following:

#include <iostream>

int main() {
std::cout << "Hello, ф or \u0444!\n";
}

This program does not require that 'ф' can be represented in a single char. On OS X and most any modern Linux install this will work just fine, because the source, execution, and console encodings will all be UTF-8 (which supports all Unicode characters).

Things are harder with Windows and there are different possibilities with different tradeoffs.

Probably the best, if you don't need portable code (you'll be using wchar_t, which should really be avoided on every other platform), is to set the mode of the output file handle to take only UTF-16 data.

#include <iostream>
#include <io.h>
#include <fcntl.h>

int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"Hello, \u0444!\n";
}

Portable code is more difficult.

How to make a unicode variable in C?

You are in a not so common use case with unicode characters outside the Basic Multilingual Plane that is with a unicode code greater than U+FFFF. Those characters can be represented in C litteral with an uppercase U prefix followed with 8 hexa chars, and are converted in their UTF8 representation. Typically the ace of spades is "\U0001FA01" or "\xf0\x9f\x82\xa1". All your cards will have the 2 same common first bytes and only last two will vary. Once you know it, and know how the UTF-8 encoding works, you can write your Display function that way:

void Display(card cards[], int num)
{
char unicode[7] = "\U0001FA00";
int i = 0;
for(i = 0; i < num; i++)
{
switch(cards[i].suite)
{
case 's':
unicode[2] = 0x82;
unicode[3] = 0xA0;
break;
case 'h':
unicode[2] = 0x82;
unicode[3] = 0xB0;
break;
case 'b':
unicode[2] = 0x83;
unicode[3] = 0x80;
break;
case 'c':
unicode[2] = 0x83;
unicode[3] = 0x90;
break;
}
unicode[3] += cards[i].val;
if(cards[i].val >= 12)
{
unicode[3] += 1;
}
printf("%s\t", unicode);
}
}

I could not really test the display, because my terminal can only display unicode characters from the BMP.

how to use unicode blockelements in C?

So what worked for me was the following:

#include <stdio.h>
#include <fcntl.h>
#include <io.h>

int main(int argc, char const *argv[]) {
_setmode(_fileno(stdout), _O_U16TEXT);
wprintf(L"\x2590 \x2554 \x258c \x2592"); //Output : ▐ ╔ ▌ ▒
return 0;
}

the function _setmode() is apparently for setting the console on u16 text encoding. wprintf() allows you to print wide characters (unicode aswell). The L"" before the string indicates to the compiler, that the following string is a unicode string. Thanks to everyone for their time and answers!

How can I store a unicode in C?

How can I store a unicode in C?

Since C11, "to store a Unicode codepoint", use char32_t @Shawn

#include <uchar.h>

char32_t ch1 = 0x1F319;
char32_t ch2 = U'\U0001f319';

Works on my Windows computer. ref



char32_t

which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t... C11 §7.27 2

How can I use Unicode in Turbo C++?

As stated, Turbo C++ won't get you any straight access to Unicode. It is likely that it is so old that it can't even generate code that could be made to use the system's libraries (DLL), so - even by recreating header files by hand, you could not call wprintf which could output proper Unicode even on the arcane cmd terminal Microsoft ships with Windows to this day.

However, the default character encoding used in the cmd terminal supports some non-ASCII characters - which exactly will depend on the language (locale) configuration of your OS. (For example, for Western European languages, it is usually "cp-852" - although it can be CP 850, if your Windows is in English.

None of these legacy 8-bit character map encodings will include all ten digits as super-script - but you might have some available (CP 850 features "¹,²,³", for example).

So, you could check the terminal code page, and check on Wikipedia for their codes - you can inspect and change the current code page with the chcp command in the Windows terminal. If your Windows version supports UTF-8, which covers all printable Unicode characters, you have to type chcp 65001 in the terminal. (I don't know which Windows editions support that, nor which you are using.)

Once you manage to do that, all you need is to print the byte-sequences for the super-script digits in UTF-8, using the "\xHH" encoding for characters in a string (I am not sure if Turbo C++ will allow it. Otherwise, `printf ("%c%c", 0xHH, 0xHH) will work.)

For your convenience, I am attaching the codepoints and UTF-8 encodings for superscripts:

0x00B2: SUPERSCRIPT TWO - ² - utf-8 seq: b'\xc2\xb2'
0x00B3: SUPERSCRIPT THREE - ³ - utf-8 seq: b'\xc2\xb3'
0x00B9: SUPERSCRIPT ONE - ¹ - utf-8 seq: b'\xc2\xb9'
0x0670: ARABIC LETTER SUPERSCRIPT ALEF - ٰ - utf-8 seq: b'\xd9\xb0'
0x0711: SYRIAC LETTER SUPERSCRIPT ALAPH - ܑ - utf-8 seq: b'\xdc\x91'
0x2070: SUPERSCRIPT ZERO - ⁰ - utf-8 seq: b'\xe2\x81\xb0'
0x2071: SUPERSCRIPT LATIN SMALL LETTER I - ⁱ - utf-8 seq: b'\xe2\x81\xb1'
0x2074: SUPERSCRIPT FOUR - ⁴ - utf-8 seq: b'\xe2\x81\xb4'
0x2075: SUPERSCRIPT FIVE - ⁵ - utf-8 seq: b'\xe2\x81\xb5'
0x2076: SUPERSCRIPT SIX - ⁶ - utf-8 seq: b'\xe2\x81\xb6'
0x2077: SUPERSCRIPT SEVEN - ⁷ - utf-8 seq: b'\xe2\x81\xb7'
0x2078: SUPERSCRIPT EIGHT - ⁸ - utf-8 seq: b'\xe2\x81\xb8'
0x2079: SUPERSCRIPT NINE - ⁹ - utf-8 seq: b'\xe2\x81\xb9'
0x207A: SUPERSCRIPT PLUS SIGN - ⁺ - utf-8 seq: b'\xe2\x81\xba'
0x207B: SUPERSCRIPT MINUS - ⁻ - utf-8 seq: b'\xe2\x81\xbb'
0x207C: SUPERSCRIPT EQUALS SIGN - ⁼ - utf-8 seq: b'\xe2\x81\xbc'
0x207D: SUPERSCRIPT LEFT PARENTHESIS - ⁽ - utf-8 seq: b'\xe2\x81\xbd'
0x207E: SUPERSCRIPT RIGHT PARENTHESIS - ⁾ - utf-8 seq: b'\xe2\x81\xbe'
0x207F: SUPERSCRIPT LATIN SMALL LETTER N - ⁿ - utf-8 seq: b'\xe2\x81\xbf'
0xFC5B: ARABIC LIGATURE THAL WITH SUPERSCRIPT ALEF ISOLATED FORM - ﱛ - utf-8 seq: b'\xef\xb1\x9b'
0xFC5C: ARABIC LIGATURE REH WITH SUPERSCRIPT ALEF ISOLATED FORM - ﱜ - utf-8 seq: b'\xef\xb1\x9c'
0xFC5D: ARABIC LIGATURE ALEF MAKSURA WITH SUPERSCRIPT ALEF ISOLATED FORM - ﱝ - utf-8 seq: b'\xef\xb1\x9d'
0xFC63: ARABIC LIGATURE SHADDA WITH SUPERSCRIPT ALEF ISOLATED FORM - ﱣ - utf-8 seq: b'\xef\xb1\xa3'
0xFC90: ARABIC LIGATURE ALEF MAKSURA WITH SUPERSCRIPT ALEF FINAL FORM - ﲐ - utf-8 seq: b'\xef\xb2\x90'
0xFCD9: ARABIC LIGATURE HEH WITH SUPERSCRIPT ALEF INITIAL FORM - ﳙ - utf-8 seq: b'\xef\xb3\x99'

(This was generated with the following Python snippet in interactive mode:)

import unicodedata
for i in range(0, 0x10ffff):
char = chr(i)
try:
name = unicodedata.name(char)
except ValueError:
pass
if "SUPERSCRIPT" not in name:
continue
print(f"0x{i:04X}: {name} - {char} - utf-8 seq: {char.encode('utf-8')}")



Related Topics



Leave a reply



Submit