Utf-8 to Unicode Code Points

why cannot UTF-8 encoding of unicode code points fit in 3 bytes

"unicode" is not an encoding. The common encodings for Unicode are UTF-8, UTF-16 and UTF-32. UTF-8 uses 1-, 2-, 3- or 4-byte sequences and is explained below. It is the overhead of the leading/trailing bit sequences that requires 4 bytes for a 21-bit value.

The UTF-8 encoding uses up to 4 bytes to represent Unicode code points using the following bit patterns:

1-byte UTF-8 = 0xxxxxxxbin = 7 bits = U+0000 to U+007F

2-byte UTF-8 = 110xxxxx 10xxxxxxbin = 11 bits = U+0080 to U+07FF

3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxxbin = 16 bits = U+0800 to U+FFFF

4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxbin = 21 bits = U+10000 to U+10FFFF

The advantage of UTF-8 is the lead bytes are unique patterns, and trailing bytes are a unique pattern and allow for easy validation of a correct UTF-8 sequence.

Note also it is illegal to use a longer encoding for a Unicode value that fits into a smaller sequence. For example:

1100_0001 1000_0001bin or C1 81hex encodes U+0041, but 0100_0001bin (41hex) is the shorter sequence.

Ref: https://en.wikipedia.org/wiki/UTF-8

How to efficiently convert between unicode code points and UTF-8 literals in python?

Actually I don't think you need to go via utf-8 at all here. int will give you the codepoint

>>> int('00A1', 16)
161

And then it's just chr

>>> chr(161)
'¡'

UTF-8 hex to unicode code point (only math)

This video is the perfect source (watch from 6:15), but here is its summary and code sample in golang. With letters I mark bits taken from UTF-8 bytes, hopefully it makes sense. When you understand the logic it's easy to apply bitwise operators):









































BytesCharUTF-8 bytesUnicode code pointExplanation
1-byte (ASCII)E1. 0xxx xxxx
0100 0101 or 0x45
1. 0xxx xxxx
0100 0101 or U+0045
no conversion needed, the same value in UTF-8 and unicode code point
2-byteÊ1. 110x xxxx
2. 10yy yyyy
1100 0011 1000 1010 or 0xC38A
0xxx xxyy yyyy
0000 1100 1010 or U+00CA
1. First 5 bits of the 1st byte
2. First 6 bits of the 2nd byte
3-byte1. 1110 xxxx
2. 10yy yyyy
3. 10zz zzzz
1110 0011 1000 0001 1000 0010 or 0xE38182
xxxx yyyy yyzz zzzz
0011 0000 0100 0010 or U+3042
1. First 4 bits of the 1st byte
2. First 6 bits of the 2nd byte
3. First 6 bits of the 3rd byte
4-byte/td>
1. 1111 0xxx
2. 10yy yyyy
3. 10zz zzzz
4. 10ww wwww
1111 0000 1001 0000 1000 0100 1001 1111 or 0xF090_849F
000x xxyy yyyy zzzz zzww wwww
0000 0001 0000 0001 0001 1111 or U+1011F
1. First 3 bits of the 1st byte
2. First 6 bits of the 2nd byte
3. First 6 bits of the 3rd byte
4. First 6 bits of the 4th byte

UTF-8 to Unicode Code Points

Converting one character set to another can be done with iconv:

http://php.net/manual/en/function.iconv.php

Note that UTF is already an Unicode encoding.

Another way is simply using htmlentities with the right character set:

http://php.net/manual/en/function.htmlentities.php

Character Encoding depending on the code points

A single 8-bit byte can hold a maximum of 256 values (0-255), so it cannot hold the majority of Unicode codepoints as-is (over 1 million).

UTFs (Unicode Transformation Formats) are standardized encodings designed to represent Unicode codepoints as encoded codeunits, which can then be expressed in byte format. The number expressed in a UTF's name represents the # of bits used to encode each codeunit:

  • UTF-8 uses 8-bit codeunits
  • UTF-16 uses 16-bit codeunits
  • UTF-32 uses 32-bit codeunits
  • and so on (there are other UTFs available, but these 3 are the main ones used).

Most UTFs are variable length (UTF-32 is not), requiring 1 or more codeunits to encode a given codepoint:

  • In UTF-8, codepoints in the ASCII range (U+0000 - U+007F) use 1 codeunit, higher codepoints use 2-4 codeunits depending on codepoint value.

  • In UTF-16, codepoints in the BMP (U+0000 - U+FFFF) use 1 codeunit, higher codepoints use 2 codeunits (known as a "surrogate pair").

  • In UTF-32, all codepoints use 1 32-bit codeunit.

So, for example, using the codepoints you mentioned, they would be encoded as follows:


U+0061 LATIN SMALL LETTER A

UTF | Codeunits | Bytes
-----------------------------------------
UTF-8 | x61 | x61
-----------------------------------------
UTF-16 | x0061 | x61 x00 (LE)
| | x00 x61 (BE)
-----------------------------------------
UTF-32 | x00000061 | x61 x00 x00 x00 (LE)
| | x00 x00 x00 x61 (BE)

U+00E2 SMALL LETTER A WITH CIRCUMFLEX

UTF | Codeunits | Bytes
-----------------------------------------
UTF-8 | xC3 xA2 | xC3 xA2
-----------------------------------------
UTF-16 | x00E2 | xE2 x00 (LE)
| | x00 xE2 (BE)
-----------------------------------------
UTF-32 | x000000E2 | xE2 x00 x00 x00 (LE)
| | x00 x00 x00 xE2 (BE)

U+0408 CYRILLIC CAPITAL LETTER JE

UTF | Codeunits | Bytes
-----------------------------------------
UTF-8 | xD0 x88 | xD0 x88
-----------------------------------------
UTF-16 | x0408 | x08 x04 (LE)
| | x04 x08 (BE)
-----------------------------------------
UTF-32 | x00000408 | x08 x04 x00 x00 (LE)
| | x00 x00 x04 x08 (BE)

And just for good measure, here are a couple of other examples:


U+20AC EURO SIGN

UTF | Codeunits | Bytes
-------------------------------------------
UTF-8 | xE2 x82 xAC | xE2 x82 xAC
-------------------------------------------
UTF-16 | x20AC | xAC x20 (LE)
| | x20 xAC (BE)
-------------------------------------------
UTF-32 | x000020AC | xAC x20 x00 x00 (LE)
| | x00 x00 x20 xAC (BE)

U+1F601 GRINNING FACE WITH SMILING EYES

UTF | Codeunits | Bytes
-----------------------------------------------
UTF-8 | xF0 x9F x98 x81 | xF0 x9F x98 x81
-----------------------------------------------
UTF-16 | xD83D xDE01 | x3D xD8 x01 xDE (LE)
| | xD8 x3D xDE x01 (BE)
-----------------------------------------------
UTF-32 | x0001F601 | x01 xF6 x01 x00 (LE)
| | x00 x01 xF6 x01 (BE)

As you can see, UTF-8 is not always the most efficient, in terms of byte size. It is good for Latin-based languages, but not so good for Asian languages, symbols, emoji, etc. On the other hand, it doesn't suffer from endian issues, like UTF-16 and UTF-32 do, so it is nice for data storage and communications. For most common uses of Unicode, UTF-8 is decent enough, though UTF-16 is better in some cases. UTF-16 is easier to work with than UTF-8 (UTF-32 is best) when processing Unicode data in memory, as there is less variation to deal with.

Manually converting unicode codepoints into UTF-8 and UTF-16

Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)

The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:

Encoding
The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:

0x00000000 - 0x0000007F:
0xxxxxxx

0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx

0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx

0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

[... removed obsolete five and six byte forms ...]

The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent the
code number of the character can be used.

The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
as 0xfffe and 0xffff (UCS noncharacters) should not appear in
conforming UTF-8 streams.

It might be easier to remember a 'compressed' version of the chart:

Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.

0x80      5 bits, one byte
0x800 4 bits, two bytes
0x10000 3 bits, three bytes

You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:

2**(5+1*6) == 2048       == 0x800
2**(4+2*6) == 65536 == 0x10000
2**(3+3*6) == 2097152 == 0x200000

I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)

Update

Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:

U+4E3E

This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:

   1110xxxx 10xxxxxx 10xxxxxx

0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):

   1110x100 10111000 10111110

There is an x spot left over at the start, fill it in with 0:

   11100100 10111000 10111110

Convert from bits to hex:

   0xE4 0xB8 0xBE

How can I convert all UTF8 Unicode characters in a string to their relevant Codepoints using bash/shell/zsh?

Using perl, works in any shell as long as the arguments are encoded in UTF-8:

$ perl -CA -E 'for my $arg (@ARGV) { say map { my $cp = ord; $cp > 127 ? sprintf "U+%04X", $cp : $_ } split //, $arg }' "Myquot;
MyU+1F4D4

Non-ASCII codepoints are printed as U+XXXX (0-padded, more hex digits if needed), ASCII ones as human-readable letters.


Or for maximum speed, a C program that does the same:

// Compile with: gcc -o print_unicode -std=c11 -O -Wall -Wextra print_unicode.c                                                                                                                                                                                 
#include <assert.h>
#include <inttypes.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <uchar.h>

#if __STDC_VERSION__ < 201112L
#error "Need C11 or newer"
#endif
#ifndef __STDC_UTF_32__
#error "Not using Unicode"
#endif

int main(int argc, char **argv) {
// arguments should be encoded according to locale's character set
setlocale(LC_CTYPE, "");

for (int i = 1; i < argc; i++) {
char *s = argv[i];
size_t len = strlen(argv[i]);
mbstate_t state;
memset(&state, 0, sizeof state);

while (len > 0) {
char32_t c;
size_t rc = mbrtoc32(&c, s, len, &state);
assert(rc != (size_t)-3);
if (rc == (size_t)-1) {
perror("mbrtoc32");
return EXIT_FAILURE;
} else if (rc == (size_t)-2) {
fprintf(stderr, "Argument %d is incomplete!\n", i);
return EXIT_FAILURE;
} else if (rc > 0) {
if (c > 127) {
printf("U+%04" PRIXLEAST32, c);
} else {
putchar((char)c);
}
s += rc;
len -= rc;
}
putchar('\n');
}
return 0;
}
$ ./print_unicode "Myquot;
MyU+1F4D4


Related Topics



Leave a reply



Submit