How to Decode a Utf16 String into a Unicode Character

How to decode a UTF16 string into a Unicode character

This is a little bit of a cheat but UTF-16 happens to be the encoding used by NSString so you can borrow the methods of NSString to achieve it:

extension String {
    func decodeBlock() -> String? {
        var chars = [unichar]()

        for substr in self.components(separatedBy: "\\u") where !substr.isEmpty {
            if let value = UInt16(substr, radix: 16) {
                chars.append(value)
            } else {
                return nil
            }
        }

        return NSString(characters: chars, length: chars.count) as String
    }
}

if let decoded = "\\uD83E\\uDD1B\\uD83C\\uDFFD".decodeBlock() {
    print(decoded)
} else {
    print("Cannot decode")
}

how to decode utf-16 with % as delimiter string to the original form in python3?

An easy way to do it is to replace % with \ to make it a python literal with escaped unicode characters, and then decode it with unicode-escape.

s = b'%u062a%u0633%u062a'
print(s.replace(b'%', b'\\').decode('unicode-escape'))

What's the most efficient way to decode a UTF16 binary?

OK, there doesn't seem to be an easy way to do it. So I just added two codecs UTF-16LE/BE for this purpose. See this commit: https://github.com/zsx/r3/commit/630945070eaa4ae4310f53d9dbf34c30db712a21

With this change, you can do:

>> b: encode 'utf-16le "hello"
== #{680065006C006C006F00}

>> s: decode 'utf-16le b       
== "hello"

>> b: encode 'utf-16be "hello" 
== #{00680065006C006C006F}

>> s: decode 'utf-16be b 
== "hello"

How can I convert unicode string contain characters that out-of-range utf-8 or 16 to binary or hex in python?

The answer is: you don't! One can't do that and end up with a byte sequence that will, per se, represent back the original text.

The fact is that if a unicode character does not have a representation in utf-8 or utf-16, it can't be represented as such, end of story.

If one ends up with arbitrary data inside a text string, and have to store that as bytes, one can use one of the "charmap" codecs, in which each of the characters in the 0-255 range have a representation, and then those bytes can roundtrip to bytes and back to text (but you should just use then as bytes anyway).

If you have arbitrary higher codepoints that are "non characters", normally you can't encode then. The utf-8 and utf-16 descriptions allow arbitrary characters to be encoded - as the specs describe those encodings as bit-field mappings to get back to the codepoint value. However, the special "surrogate" character class, that are exactly the characters used by utf-16 to represent characters outside of the Base Multilingual Plane (BMP), are explicitly out-ruled.

Fortunately (or unfortunately, since it looks like you may be doing "the wrong thing" to start with), Python have, since python 3.1, explictly enabled the encoding of surrogate characters as utf-8 (and later as utf-16 and utf32) characters, by selecting a special "errors" policy on encode and decoding.

Keep in mind, as I wrote in the starting sentence, that the resulting byte sequence is not valid utf-8 (or 16) "as is" - any code consuming this data back, have to be aware of how the byte-sequence was created, and use the same "allow surrogates" policy on decoding:


In [75]: a = "maçã\ud875"                                                                                                                                                                      

In [76]: b = a.encode("utf-8", errors="surrogatepass")                                                                                                                                         

In [77]: b                                                                                                                                                                                     
Out[77]: b'ma\xc3\xa7\xc3\xa3\xed\xa1\xb5'

In [78]: b.decode("utf-8")                                                                                                                                                                     
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-78-a863a95176d0> in <module>
----> 1 b.decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 6: invalid continuation byte

In [79]: b.decode("utf-8", errors="surrogatepass")                                                                                                                                             
Out[79]: 'maçã\ud875'

In [80]: b.decode("utf-8", errors="surrogatepass") == a                                                                                                                                        
Out[80]: True

You could also use errors="xmlcharrefreplace" and errors="backslashreplace", but reading those back would be even more cumbersome, besides, if the text would have embedded literal sequences of representations of characters with those escaping methods, those would be converted to the characters in the final form - The positive point in doing this is that the resulting bytes would be valid utf-8:

In [82]: a = "maçã\ud875"                                                                                                                                                                      

In [83]: b = a.encode("utf8", errors="backslashreplace")                                                                                                                                       

In [84]: b                                                                                                                                                                                     
Out[84]: b'ma\xc3\xa7\xc3\xa3\\ud875'

In [85]: c = b.decode("utf-8")                                                                                                                                                                 

In [86]: c == a                                                                                                                                                                                
Out[86]: False

In [87]: c                                                                                                                                                                                     
Out[87]: 'maçã\\ud875'

In [88]: d = c.encode("latin1").decode("unicode_escape")                                                                                                                                       

In [89]: d                                                                                                                                                                                     
Out[89]: 'maçã\ud875'

In [90]: d == a                                                                                                                                                                                
Out[90]: True

How to convert Unicode string into a utf-8 or utf-16 string?

Short answer:

No conversion required if you use Unicode strings such as CString or wstring. Use sqlite3_open16().
You will have to make sure you pass a WCHAR pointer (casted to void *. Seems lame! Even if this lib is cross platform, I guess they could have defined a wide char type that depends on the platform and is less unfriendly than a void *) to the API. Such as for a CString: (void*)(LPCWSTR)strFilename

The longer answer:

You don't have a Unicode string that you want to convert to UTF8 or UTF16. You have a Unicode string represented in your program using a given encoding: Unicode is not a binary representation per se. Encodings say how the Unicode code points (numerical values) are represented in memory (binary layout of the number). UTF8 and UTF16 are the most widely used encodings. They are very different though.

When a VS project says "Unicode charset", it actually means "characters are encoded as UTF16". Therefore, you can use sqlite3_open16() directly. No conversion required. Characters are stored in WCHAR type (as opposed to char) which takes 16 bits (Fallsback on standard C type wchar_t, which takes 16 bits on Win32. Might be different on other platforms. Thanks for the correction, Checkers).

There's one more detail that you might want to pay attention to: UTF16 exists in 2 flavors: Big Endian and Little Endian. That's the byte ordering of these 16 bits. The function prototype you give for UTF16 doesn't say which ordering is used. But you're pretty safe assuming that sqlite uses the same endian-ness as Windows (Little Endian IIRC. I know the order but have always had problem with the names :-) ).

EDIT: Answer to comment by Checkers:

UTF16 uses 16 bits code units. Under Win32 (and only on Win32), wchar_t is used for such storage unit. The trick is that some Unicode characters require a sequence of 2 such 16-bits code units. They are called Surrogate Pairs.

The same way an UTF8 represents 1 character using a 1 to 4 bytes sequence. Yet UTF8 are used with the char type.

How to get a character from its UTF-16 code points in Python 3?

The trick is not to mess with chr but rather to convert to a byte array, which you can then decode into a string:

a, b = 55357, 56501
x = a.to_bytes(2, 'little') + b.to_bytes(2, 'little')

print(x.decode('UTF-16'))

This can be generalized for any number of integers:

data = [55357, 56501]
b = bytes([x for c in data for x in c.to_bytes(2, 'little')])
result = b.decode('utf-16')

The reason something like chr(55357) + chr(56501) doesn't work is that chr assumes no encoding. It works on the raw Unicode code points, so you are combining two distinct characters. As the other answer points out, you then have to encode this two character string and re-decode it, or just get the bytes and decode once as I'm suggesting.

UTF-16 Character Encoding of java

In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).

See http://www.unicode.org/faq/utf_bom.html#gen7

EDIT - Use this program to look into the actual bytes generated by different encodings:

public class Test {
    public static void main(String args[]) throws Exception {
        // bytes in the first argument, encoded using second argument
        byte[] bs = args[0].getBytes(args[1]);
        System.err.println(bs.length + " bytes:");

        // print hex values of bytes and (if printable), the char itself
        char[] hex = "0123456789ABCDEF".toCharArray();
        for (int i=0; i<bs.length; i++) {
            int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
            System.err.print(hex[b>>4] + "" + hex[b&0xf] 
                + ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
                + ( (i%4 == 3) ? "\n" : " "));
        }
        System.err.println();   
    }
}

For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:

$ javac Test.java  && java -cp . Test hello UTF-16
12 bytes:
FEþ FFÿ 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo

And

$ javac Test.java  && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00.

And

$ javac Test.java  && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo

How to Decode a Utf16 String into a Unicode Character