Detect When a Unicode Character Cannot Be Displayed Correctly

Detect when a unicode character cannot be displayed correctly

You can use CTFontGetGlyphsForCharacters() to determine if a font has a glyph for a particular code point (note that supplementary characters need to be checked as surrogate pairs):

CTFontRef font = CTFontCreateWithName(CFSTR("Helvetica"), 12, NULL);
const UniChar code_point[] = { 0xD83C, 0xDCA1 };  // U+1F0A1
CGGlyph glyph[] = { 0, 0 };
bool has_glyph = CTFontGetGlyphsForCharacters(font, code_point, glyph, 2);

Or, in Swift:

let font = CTFontCreateWithName("Helvetica", 12, nil)
var code_point: [UniChar] = [0xD83C, 0xDCA1]
var glyphs: [CGGlyph] = [0, 0]
let has_glyph = CTFontGetGlyphsForCharacters(font, &code_point, &glyph, 2)

If you want to check the complete set of fallback fonts that the system will try to load a glyph from, you will need to check all of the fonts returned by CTFontCopyDefaultCascadeListForLanguages(). Check the answer to this question for information on how the fallback font list is created.

In Java how to detect if a unicode image can be displayed in a JButton?

It depends on the font, so use `Font::canDisplayUpTo`

So actually, grab your Font and try the following:

JButton button = new JButton(str);
Font font = button.getFont();
int failingIndex = font.canDisplayUpTo(str);
if (failingIndex >= 0) {
  // failingIndex points to the first codepoint in your string that cannot be represented with the font.
} else {
  // This string can be displayed with the given font.
}

So if the font cannot render the characters as expected, use another font that can.

Unicode characters not getting displayed correctly on localhost

As I know browser doesn't have to use utf-8 as default encoding but i.e. iso8859-2.

Browser doesn't know what encoding is inside file and you have to use HTTP header to inform it

self.send_header("Content-Type", "text/plain; charset=utf-8")

Minimal working example

from http.server import HTTPServer, BaseHTTPRequestHandler

class Serv(BaseHTTPRequestHandler):

    def do_GET(self):
        text = '‾'

        self.send_response(200)

        self.send_header("Content-Type", "text/plain; charset=utf-8")

        self.end_headers()

        #self.wfile.write(bytes(text, 'utf-8'))
        self.wfile.write(text.encode('utf-8'))

print('Serving http://localhost:8080')
httpd = HTTPServer(('localhost', 8080), Serv)
httpd.serve_forever()

EDIT:

If you will send file with HTML then inside file you can use HTML tag

<meta charset="utf-8">

Minimal working example

from http.server import HTTPServer, BaseHTTPRequestHandler

class Serv(BaseHTTPRequestHandler):

    def do_GET(self):
        text = '''<!DOCTYPE html>
<html>

<head>
<meta charset="utf-8">
</head>

<body>
‾
</body>

</html>
'''
        
        self.send_response(200)
        
        self.end_headers()
        
        #self.wfile.write(bytes(text, 'utf-8'))
        self.wfile.write(text.encode('utf-8'))

print('Serving http://localhost:8080')
httpd = HTTPServer(('localhost', 8080), Serv)
httpd.serve_forever()

Unicode characters does not displaying correctly in my website when using javascript

Use decodeURI function.

Like this:

if(query != null) document.getElementById("search-text").value = decodeURI(query);

Text with unicode CODES not displaying correctly in Python 3.7

We can do that in two steps:

First, we read the file with encoding='unicode_escape' to convert all of the \uxxxx.

Then, we convert this to utf-8 by encoding it transparently to a bytes object (with latin-1 codec) and convert it to text again, decoding as utf-8

with open('text.txt', encoding='unicode-escape') as f:
    text = f.read()
    print(text)
    #Edward escribiÃ³ la biografÃa de su autor favorito

    # Now we convert it to utf-8
    text = text.encode('latin1').decode('utf8')
    print(text)
    # Edward escribió la biografía de su autor favorito

Identify unicode characters that can't be printed

While it's not very easy to tell if the terminal running your script (or the font your terminal is using) is able to render a given character correctly, you can at least check that the character actually has a representation.

The character \ua62b is defined as VAI SYLLABLE NDOLE DO, whereas the character \ua62c has no definition, hence why it may be rendered as a square or other generic symbol.

One way to check if a character is defined is to use the unicodedata module:

>>> import unicodedata
>>> unicodedata.name(u"\ua62b")
'VAI SYLLABLE NDOLE DO'
>>> unicodedata.name(u"\ua62c")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: no such name

As you can see above, a ValueError is raised for the \ua62c character because it isn't defined.

Another method is to check the category of the character. If it is Cn then the character is not assigned:

>>> import unicodedata
>>> unicodedata.category(u"\ua62b")
'Lo'
>>> unicodedata.category(u"\ua62c")
'Cn'

Why My Applicaion cannot display unicode character correctly?

Make sure to save the source file as UTF-16 or UTF-8 with BOM. Many Windows applications assume the ANSI encoding (default localized Windows code page) otherwise. You can also check compiler switches to force using UTF-8 for source files. For example, MS Visual Studio 2015's compiler has a /utf-8 switch so saving with BOM is not required.

Here's a simple example saved in UTF-8, and then UTF-8 w/ BOM and compiled with the Microsoft Visual Studio compiler. Note that there is no need to define UNICODE if you hard-code the W versions of the APIs and use L"" for wide strings:

#include <windows.h>

int main()
{
    MessageBoxW(NULL,L"ا ب ت ث ج ح خ د ذ",L"中文",MB_OK);
}

Result (UTF-8). The compiler assumed ANSI encoding (Windows-1252) and decoded the wide string incorrectly.

Corrupted image

Result (UTF-8 w/ BOM). The compiler detects the BOM and uses UTF-8 to decode the source code, resulting in the correct data generated for the wide strings.

Correct image

A little Python code demonstrating the decode error:

>>> s='中文,ا ب ت ث ج ح خ د ذ'
>>> print(s.encode('utf8').decode('Windows-1252'))
ä¸æ–‡,Ø§ Ø¨ Øª Ø« Ø¬ Ø Ø® Ø¯ Ø°

Detect When a Unicode Character Cannot Be Displayed Correctly