How to Print Unicode Character in C++

Printing a Unicode Symbol in C

Two problems: first of all, a wchar_t must be printed with %lc format, not %c. The second one is that unless you call setlocale the character set is not set properly, and you probably get ? instead of your star. The following code seems to work though:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

int main() {
setlocale(LC_CTYPE, "");
wchar_t star = 0x2605;
wprintf(L"%lc\n", star);
}

And for ncurses, just initialize the locale before the call to initscr.

Printing Unicode characters using write(2) in c

I'm working on a small piece of code that prints characters to the
screen, and must support all of Unicode contained in a wchar_t, and
i'm limited to only write(2).

That's a problematic combination of requirements. In particular, wchar_t character representation may very well not play nicely with using write() for output.

More generally, there are multiple issues here, among them:

  1. The members of the source and execution character sets.
  2. How to represent extended characters of the execution character set in your source (via the source character set).
  3. How to present extended characters of the execution character set to the output device of your choice, so that device handles them as desired.

Note well that that C specifies only a fairly small set of characters that must be present in the execution character set. Additional, "extended", characters may be present in it, and your emoji would fall into this category. Dealing with extended characters via the standard C interfaces is a bit mushy, as the standard affords implementations a great deal freedom in how they do things there.

So \U seem to be the way to go.

The \U introduces a "universal character name". It is important to understand that these sequences are converted to members of the execution character set during compilation.

However, i can't get to convert the
wchar_t into the proper escape sequence, ie converting wchar_t c =
L'';
into \U0001f921

It is not safe to assume that '' can be represented directly in the source character set, so as to use it literally in your source code. That depends on your C implementation. A universal character name is safer. Furthermore, if you want a wide character constant then you can try L'\U0001f921', but there's a good chance that wchar_t cannot represent that character. In particular, many implementations have 16-bit wchar_t, and those are unlikely to be able to support your character as a (single) wchar_t.

You may have better luck with a wide string literal: L"\U0001f921", but this is useful to you primarily if you are working with the wide-character-specific functions, which will perform appropriate encoding conversions for you. write() will not perform such conversions, so whether it produces the desired result will depend on the configuration of your runtime environment. I judge your original approach, with an ordinary string literal, to be more likely to work.

If you wish, and if you can use C2011 features, then you can also express a (regular) string literal that is defined to be encoded in UTF-8, regardless of what the actual execution character set is. The form for that would be u8"\U0001f921". Again, though, producing your desired result this way depends on your environment. UTF-8 literals are better suited to interacting with interfaces that are specifically defined to use UTF-8.

Can i even do that in C ?

It is not safe to assume that your emoji character can be represented by a single object of type wchar_t. There may be C implementations that support it, but I think they are uncommon.

One final note: this code ...

write(1, "\U0001f921", 6);

... almost certainly exhibits undefined behavior as a result of overrunning the bounds of the char array you are presenting to write(). I don't see any plausible scenario in which it is longer than 5 characters, but you write 6, overrunning by at least 1. If the internal representation is UTF-8, then that array will have length 4 -- three bytes encoding the character, and one for the string terminator.

You should measure the length to find out how many bytes to write, for example:

const char *emoji = "\U0001f921";
write(1, emoji, strlen(emoji));

How to print Unicode character in C++?

To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.

// both of these assume that the character can be represented with
// a single char in the execution encoding
char b = '\u0444';
char a = 'ф'; // this line additionally assumes that the source character encoding supports this character

Printing such characters out depends on what you're printing to. If you're printing to a Unix terminal emulator, the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do the following:

#include <iostream>

int main() {
std::cout << "Hello, ф or \u0444!\n";
}

This program does not require that 'ф' can be represented in a single char. On OS X and most any modern Linux install this will work just fine, because the source, execution, and console encodings will all be UTF-8 (which supports all Unicode characters).

Things are harder with Windows and there are different possibilities with different tradeoffs.

Probably the best, if you don't need portable code (you'll be using wchar_t, which should really be avoided on every other platform), is to set the mode of the output file handle to take only UTF-16 data.

#include <iostream>
#include <io.h>
#include <fcntl.h>

int main() {
_setmode(_fileno(stdout), _O_U16TEXT);
std::wcout << L"Hello, \u0444!\n";
}

Portable code is more difficult.

C++: How can I print unicode characters to text file

std::ofstream operates on char data and does not have any operator<< that takes wchar_t data as input. It does have operator<< that takes int, and wchar_t is implicitly convertible to int. That is why you are getting the character outputted as a number.

You need to use std::wofstream instead for wchar_t data.

How do I print Unicode to the output console in C with Visual Studio?

This is code that works for me (VS2017) - project with Unicode enabled

#include <stdio.h>
#include <io.h>
#include <fcntl.h>

int main()
{
_setmode(_fileno(stdout), _O_U16TEXT);
wchar_t * test = L"the 来. Testing unicode -- English -- Ελληνικά -- Español." ;

wprintf(L"%s\n", test);
}

This is console

output

After copying it to the Notepad++ I see the proper string

the 来. Testing unicode -- English -- Ελληνικά -- Español.

OS - Windows 7 English, Console font - Lucida Console

Edits based on comments

I tried to fix the above code to work with VS2019 on Windows 10 and best I could come up with is this

#include <stdio.h>
int main()
{
const auto* test = L"the 来. Testing unicode -- English -- Ελληνικά -- Español.";

wprintf(L"%s\n", test);
}

When run it "as is" I see
Default console settings

When it is run with console set to Lucida Console fond and UTF-8 encoding I see
Console switched to UTF-8

As the answer to 来 character shown as empty rectangle - I suppose is the limitation of the font which does not contain all the Unicode gliphs

When text is copied from the last console to Notepad++ all characters are shown correctly



Related Topics



Leave a reply



Submit