The Ultimate Emoji Encoding Scheme

The ultimate emoji encoding scheme

MySQL's utf8 charset is not actually UTF-8, it's a subset of UTF-8 only supporting the basic plane (characters up to U+FFFF). Most emoji use code points higher than U+FFFF. MySQL's utf8mb4 is actual UTF-8 which can encode all those code points. Outside of MySQL there's no such thing as "utf8mb4", there's just UTF-8. So:

Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?

Again, no such thing as "utf8mb4". HTTP POST requests support any raw bytes, if your client sends UTF-8 encoded data you're fine.

If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?

Yes.

Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols?

God no, use raw UTF-8 (utf8mb4) for all that is holy.

When I retrieve this symbols in PHP I first need to execute SET CHARACTER SET utf8

Well, there's your problem; channeling your data through MySQL's utf8 charset will discard any characters above U+FFFF. Use utf8mb4 all the way through MySQL.

if I get them in utf8mb4 the json_decode function doesn't work

You'll have to specify what that means exactly. PHP's JSON functions should be able to handle any Unicode code point just fine, as long as it's valid UTF-8:

echo json_encode('');
"\ud83d\ude00"

echo json_decode('"\ud83d\ude00"');
br>

Best and clean way to Encode Emojis (Python) from text file

So, I'll assume that what you somehow get a raw ASCII string that contains escape sequences with UTF-16 code units that form surrogate pairs, and that you (for whatever reason) want to convert it to \UXXXXXXXX-format.

So, henceforth I assume that your input (bytes!) look like this:

weirdInput = "hello \\ud83d\\ude04".encode("latin_1")

Now you want to do the following:

  1. Interpret the bytes in a way that \uXXXX thingies are transformed into UTF-16 code units. There is raw_unicode_escapes, but unfortunately it needs a separate pass to fix the surrogate pairs (I don't know why, to be honest)
  2. Fix the surrogate pairs, transform the data into valid UTF-16
  3. Decode as valid UTF-16
  4. Again, encode as "raw_unicode_escape"
  5. Decode back as good old latin_1, consisting only of good old ASCII with unicode escape sequences in format \UXXXXXXXX.

Something like this:

  output = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
.encode("raw_unicode_escape")
.decode("latin_1")
)

Now if you print(output), you get:

hello \U0001f604

Note that if you stop at an intermediate stage:

smiley = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
)

then you get a unicode-string with smileys:

print(smiley)
# hello br>

Full code:

weirdInput = "hello \\ud83d\\ude04".encode("latin_1")

output = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
.encode("raw_unicode_escape")
.decode("latin_1")
)

smiley = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
)

print(output)
# hello \U0001f604

print(smiley)
# hello br>

Is there a list of Unicode encoding range for the Emoji characters?

Is this (at archive.org) what you are looking for

Emoji character (😀) is not working with utf8mb4_bin in MySQL version 5.6

The connection needs to specify utf8mb4 to MySQL. What is under the covers in DOMDocument?

/code> is hex F09F9880, which should work in a column of CHARACTER SET utf8mb4 with any utf8mb4_* collation.

And here is another link: Trouble with UTF-8 characters; what I see is not what I stored

If all else fails, execute SET NAMES utf8mb4 from PHP after establishing a connection.

How to store Emoji Character in MySQL Database

Step 1, change your database's default charset:

ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

if the db is not created yet, create it with correct encodings:

CREATE DATABASE database_name DEFAULT CHARSET = utf8mb4 DEFAULT COLLATE = utf8mb4_unicode_ci;

Step 2, set charset when creating table:

CREATE TABLE IF NOT EXISTS table_name (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE utf8mb4_unicode_ci;

or alter table

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE table_name MODIFY field_name TEXT CHARSET utf8mb4;

Spliting an emoji sequence in powershell

A string is just one element. You want to change it to a character array.

foreach ($i in 'hithere') { $i }
hithere

foreach ($i in [char[]]'hithere') { $i }
h
i
t
h
e
r
e

Hmm this doesn't work well. These code points are pretty high, U+1F600 (32-bit), etc

foreach ($i in [char[]]'') { $i }       
� # 16 bit surrogate pairs?













Hmm ok, add every pair. Here's another way to do it using https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates (or just use ConvertToUTF32($emoji, 0) )

$emojis = ''
for ($i = 0; $i -lt $emojis.length; $i += 2) {
[System.Char]::IsHighSurrogate($emojis[$i])
0x10000 + ($emojis[$i] - 0xD800) * 0x400 + $emojis[$i+1] - 0xDC00 | % tostring x
# [system.char]::ConvertToUtf32($emojis,$i) | % tostring x # or
$emojis[$i] + $emojis[$i+1]
}

True
1f600
br>True
1f601
br>True
1f602
br>True
1f603
br>True
1f604
br>True
1f605
br>True
1f606
br>

Note that unicode in the Unicode.GetBytes() method call refers to utf16le encoding.

Chinese works.

[char[]]'嗨,您好'




Here it is using utf32 encoding. All characters are 4 bytes long. Converting every 4 bytes into an int32 and printing them as hex.

$emoji = ''
$utf32 = [System.Text.Encoding]::utf32.GetBytes($emoji)

for($i = 0; $i -lt $utf32.count; $i += 4) {
$int32 = [bitconverter]::ToInt32($utf32[$i..($i+3)],0)
$int32 | % tostring x
}

1f600
1f601
1f602
1f603
1f604
1f605
1f606

Or going the other way from int32 to string. Simply casting the int32 to [char] does not work (have to add pairs of [char]'s). Script reference: https://www.powershellgallery.com/packages/Emojis/0.1/Content/Emojis.psm1

for ($i = 0x1f600; $i -le 0x1f606; $i++ ) { [System.Char]::ConvertFromUtf32($i) }

br>br>br>br>br>br>br>

See also How to encode 32-bit Unicode characters in a PowerShell string literal?

EDIT:

Powershell 7 has a nice enumeraterunes() method:

$emojis = ''
$emojis.enumeraterunes() | % value | % tostring x

1f600
1f601
1f602
1f603
1f604
1f605
1f606

How to convert a string of utf-8 bytes into a unicode emoji in python

Yes, I encountered the same problem when trying to decode a Facebook message dump. Here's how I solved it:

string = "\u00f0\u009f\u0098\u0086".encode("latin-1").decode("utf-8")
# ''

Here's why:

  1. This emoji takes 4 bytes to encode in UTF-8 (F0 9F 98 86, check at the bottom of this page)
  2. Facebook could have used UTF-8 for the JSON file but they instead chose printable ASCII only. So it encodes those 4 bytes as \u00F0\u009F\u0098\u0086
  3. encode("latin-1") was a convenient way to convert these encodings back to the raw bytes.
  4. decode("utf-8") convert the raw bytes into a Unicode character.


Related Topics



Leave a reply



Submit