The ultimate emoji encoding scheme
MySQL's utf8
charset is not actually UTF-8, it's a subset of UTF-8 only supporting the basic plane (characters up to U+FFFF). Most emoji use code points higher than U+FFFF. MySQL's utf8mb4
is actual UTF-8 which can encode all those code points. Outside of MySQL there's no such thing as "utf8mb4", there's just UTF-8. So:
Does POST allow utf8mb4, or should I convert the data in the client to plain utf8?
Again, no such thing as "utf8mb4". HTTP POST requests support any raw bytes, if your client sends UTF-8 encoded data you're fine.
If my DB has collation and character set utf8mb4, does it mean I should be able to store 'raw' emojis?
Yes.
Should I try to work in the DB with utf8mb4 or is it safer/better/more supported to work in utf8 and encode symbols?
God no, use raw UTF-8 (utf8mb4
) for all that is holy.
When I retrieve this symbols in PHP I first need to execute
SET CHARACTER SET utf8
Well, there's your problem; channeling your data through MySQL's utf8
charset will discard any characters above U+FFFF. Use utf8mb4
all the way through MySQL.
if I get them in utf8mb4 the json_decode function doesn't work
You'll have to specify what that means exactly. PHP's JSON functions should be able to handle any Unicode code point just fine, as long as it's valid UTF-8:
echo json_encode('');
"\ud83d\ude00"
echo json_decode('"\ud83d\ude00"');
br>
Best and clean way to Encode Emojis (Python) from text file
So, I'll assume that what you somehow get a raw ASCII string that contains escape sequences with UTF-16 code units that form surrogate pairs, and that you (for whatever reason) want to convert it to \UXXXXXXXX
-format.
So, henceforth I assume that your input (bytes!) look like this:
weirdInput = "hello \\ud83d\\ude04".encode("latin_1")
Now you want to do the following:
- Interpret the bytes in a way that
\uXXXX
thingies are transformed into UTF-16 code units. There israw_unicode_escapes
, but unfortunately it needs a separate pass to fix the surrogate pairs (I don't know why, to be honest) - Fix the surrogate pairs, transform the data into valid UTF-16
- Decode as valid UTF-16
- Again, encode as "raw_unicode_escape"
- Decode back as good old
latin_1
, consisting only of good old ASCII with unicode escape sequences in format\UXXXXXXXX
.
Something like this:
output = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
.encode("raw_unicode_escape")
.decode("latin_1")
)
Now if you print(output)
, you get:
hello \U0001f604
Note that if you stop at an intermediate stage:
smiley = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
)
then you get a unicode-string with smileys:
print(smiley)
# hello br>
Full code:
weirdInput = "hello \\ud83d\\ude04".encode("latin_1")
output = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
.encode("raw_unicode_escape")
.decode("latin_1")
)
smiley = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
)
print(output)
# hello \U0001f604
print(smiley)
# hello br>
Is there a list of Unicode encoding range for the Emoji characters?
Is this (at archive.org) what you are looking for
Emoji character (😀) is not working with utf8mb4_bin in MySQL version 5.6
The connection needs to specify utf8mb4
to MySQL. What is under the covers in DOMDocument
?
/code> is hex
F09F9880
, which should work in a column of CHARACTER SET utf8mb4
with any utf8mb4_*
collation.
And here is another link: Trouble with UTF-8 characters; what I see is not what I stored
If all else fails, execute SET NAMES utf8mb4
from PHP after establishing a connection.
How to store Emoji Character in MySQL Database
Step 1, change your database's default charset:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
if the db is not created yet, create it with correct encodings:
CREATE DATABASE database_name DEFAULT CHARSET = utf8mb4 DEFAULT COLLATE = utf8mb4_unicode_ci;
Step 2, set charset when creating table:
CREATE TABLE IF NOT EXISTS table_name (
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE utf8mb4_unicode_ci;
or alter table
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE table_name MODIFY field_name TEXT CHARSET utf8mb4;
Spliting an emoji sequence in powershell
A string is just one element. You want to change it to a character array.
foreach ($i in 'hithere') { $i }
hithere
foreach ($i in [char[]]'hithere') { $i }
h
i
t
h
e
r
e
Hmm this doesn't work well. These code points are pretty high, U+1F600 (32-bit), etc
foreach ($i in [char[]]'') { $i }
� # 16 bit surrogate pairs?
�
�
�
�
�
�
�
�
�
�
�
�
�
Hmm ok, add every pair. Here's another way to do it using https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates (or just use ConvertToUTF32($emoji, 0) )
$emojis = ''
for ($i = 0; $i -lt $emojis.length; $i += 2) {
[System.Char]::IsHighSurrogate($emojis[$i])
0x10000 + ($emojis[$i] - 0xD800) * 0x400 + $emojis[$i+1] - 0xDC00 | % tostring x
# [system.char]::ConvertToUtf32($emojis,$i) | % tostring x # or
$emojis[$i] + $emojis[$i+1]
}
True
1f600
br>True
1f601
br>True
1f602
br>True
1f603
br>True
1f604
br>True
1f605
br>True
1f606
br>
Note that unicode in the Unicode.GetBytes() method call refers to utf16le encoding.
Chinese works.
[char[]]'嗨,您好'
嗨
,
您
好
Here it is using utf32 encoding. All characters are 4 bytes long. Converting every 4 bytes into an int32 and printing them as hex.
$emoji = ''
$utf32 = [System.Text.Encoding]::utf32.GetBytes($emoji)
for($i = 0; $i -lt $utf32.count; $i += 4) {
$int32 = [bitconverter]::ToInt32($utf32[$i..($i+3)],0)
$int32 | % tostring x
}
1f600
1f601
1f602
1f603
1f604
1f605
1f606
Or going the other way from int32 to string. Simply casting the int32 to [char]
does not work (have to add pairs of [char]'s). Script reference: https://www.powershellgallery.com/packages/Emojis/0.1/Content/Emojis.psm1
for ($i = 0x1f600; $i -le 0x1f606; $i++ ) { [System.Char]::ConvertFromUtf32($i) }
br>br>br>br>br>br>br>
See also How to encode 32-bit Unicode characters in a PowerShell string literal?
EDIT:
Powershell 7 has a nice enumeraterunes() method:
$emojis = ''
$emojis.enumeraterunes() | % value | % tostring x
1f600
1f601
1f602
1f603
1f604
1f605
1f606
How to convert a string of utf-8 bytes into a unicode emoji in python
Yes, I encountered the same problem when trying to decode a Facebook message dump. Here's how I solved it:
string = "\u00f0\u009f\u0098\u0086".encode("latin-1").decode("utf-8")
# ''
Here's why:
- This emoji takes 4 bytes to encode in UTF-8 (
F0 9F 98 86
, check at the bottom of this page) - Facebook could have used UTF-8 for the JSON file but they instead chose printable ASCII only. So it encodes those 4 bytes as
\u00F0\u009F\u0098\u0086
encode("latin-1")
was a convenient way to convert these encodings back to the raw bytes.decode("utf-8")
convert the raw bytes into a Unicode character.
Related Topics
Pass Form Data to Another Page with PHP
Google Drive API - PHP Client Library - Setting Uploadtype to Resumable Upload
How to Get the Shortest Rather Than Longest Possible Regex Match with Preg_Match()
Why Can't PHP on Windows See Extension PHP_Intl.Dll Even Though It Exists
Insert Value List Does Not Match Column List: 1136 Column Count Doesn't Match Value Count
How to Get Month from Date in MySQL
Postgres Now() Timestamp Doesn't Change, When Script Works
Get Number of Weekdays in a Given Month
Generating a Random Code in PHP
Difference Between Pdo->Query() and Pdo->Exec()
Call to Undefined Function Session_Register()
Symfony: How to Refresh the Authenticated User from the Database