Utf-8 Safe Equivalent of Ord or Charcodeat() in PHP

UTF-8 safe equivalent of ord or charCodeAt() in PHP

ord() works byte per byte (as most of PHPs standard string functions - if not all). You would need to convert it your own, for example with the help of the multibyte string extension:

$utf8Character = 'Ą';
list(, $ord) = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8'));
echo $ord; # 260

JS charCodeAt equivalent in PHP (with full unicode and emoji compatibility)

The way that JS handles UTF-16 is not ideal; charCodeAt is picking out code units for you, including surrogates in the emoji cases. If you want the real codepoint for each character, String.codePointAt() would be a better choice. That said, since your usecase wasn't explained, this achieves what you were originally asking for without the need for json related functions:

<?php

$original = 't↙️';
$converted = iconv('UTF-8', 'UTF-16LE', $original);

for ($i = 0; $i < iconv_strlen($converted, 'UTF-16LE'); $i++) {
$character = iconv_substr($converted, $i, 1, 'UTF-16LE');
$codeUnits = unpack('v*', $character);

foreach ($codeUnits as $codeUnit) {
echo $codeUnit . PHP_EOL;
}
}

This converts the (assumed) UTF-8 string into UTF-16, then loops over each character. In UTF-16, each character is 2 or 4 bytes in size. Unpack with the v repeating formatter will return one short in the former case, or 2 in the latter (v is the unsigned short formatter).

It could also be implemented by looping over the UTF-8 and converting each character one-by-one; it doesn't make a great deal of difference though. Also the same could be achieved with the mb_* functions.


Edit

Since you've inquired about a quicker way of doing this, combining the above with the solution offered by nwellnhof gives better performance:

<?php

$original = 't↙️';
$converted = iconv('UTF-8', 'UTF-16LE', $original);

for ($i = 0; $i < strlen($converted); $i += 2) {
$codeUnit = ord($converted[$i]) + (ord($converted[$i+1]) << 8);
echo $codeUnit . PHP_EOL;
}

First off, this converts the UTF-8 string into UTF-16LE. We're interested in writing out UTF-16 code units (as per the behaviour charCodeAt()), and these are represented by 16 bits. The loop is simply jumping 2 bytes at a time. For each iteration, it'll take the numeric value of the byte at that position, and add it to the next byte, left shifted by 8. The left shifting is because we're dealing with little endian formatted UTF-16.

By way of example, take consider the character BENGALI DIGIT ONE (). This is represented by a single UTF-16 code unit, 2535. It is easier to first off describe how this is encoded as UTF-16BE. The single code unit for this character would consume 16 bits:

0000100111100111 (2535)

In PHP, strings are effectively byte arrays. So, PHP sees this as:

$converted[0] = 00001001 (9)
$converted[1] = 11100111 (231)

Given the 2 above bytes, how do we obtain the code unit? What we really want to do is something like:

   0000100100000000 (2304)
+ 11100111 (231)
= 0000100111100111 (2535)

But we can't do that, since we only have single bytes to play with. One way is to deal with this is to use integers instead, giving us a full 64 bits (8 bytes).. and we want to represent the code unit in integer form anyway, so that seems like a reasonable route. We can obtain the numeric value of each byte via ord():

ord($converted[0]) == 0000000000000000000000000000000000000000000000000000000000001001 == 9
ord($converted[1]) == 0000000000000000000000000000000000000000000000000000000011100111 = 231

And left shift the first value by 8:

   0000000000000000000000000000000000000000000000000000000000001001 (9) 
<< 0000000000000000000000000000000000000000000000000000000000001000 (8)
= 0000000000000000000000000000000000000000000000000000100100000000 (2304)

And then sum together, as before:

   0000000000000000000000000000000000000000000000000000100100000000 (2304)
+ 0000000000000000000000000000000000000000000000000000000011100111 (231)
= 0000000000000000000000000000000000000000000000000000100111100111 (2535)

So we now have the correct code unit value of 2535. The only difference with UTF-16LE is the order of the bytes is reversed. So instead of left shifting the first byte by 8, we need to left shift the second byte.

P.S: An equivalent way of performing this step would be to do

for ($i = 0; $i < strlen($converted); $i += 2) {
$codeUnit = unpack('v', $converted[$i] . $converted[$i+1]);
echo $codeUnit . PHP_EOL;
}

The unpack function will do exactly as just described which the v formatter is supplied, which tells it to expect 16 bits arranged in little endian. It may be worth benchmarking the 2 if you're interested in optimising for speed.

ord() doesn't work with utf-8

According to Wikipedia and FileFormat,

  • ISO-8859-1 doesn't have the Euro symbol at all
  • ISO-8859-15 has it at codepoint 164 (0xA4)
  • Windows-1252 has it at codepoint 128 (0x80)
  • Unicode has the Euro symbol at codepoint 8364 (0x20AC)
  • UTF-8 encodes that as 0xE2 0x82 0xAC. The first byte E2 is 226 in decimal.

So it seems your source file is encoded in UTF-8 (and ord() only returns the first byte), whereas your output is in Windows-1252.

Searching for a good Unicode-compatible alternative to the PHP ord() function

You might also be able to implement this function using iconv(), but the mb_convert_encoding method you've got looks reasonable to me. Just make sure that $utf8Character is a single character, not a long string, and it'll perform reasonably well.

UCS-4BE is a Unicode encoding which stores each character as a 32-bit (4 byte) integer. This accounts for the "UCS-4"; the "BE" prefix indicates that the integers are stored in big-endian order. The reason for this encoding is that, unlike smaller encodings (like UTF-8 or UTF-16), it requires no surrogate pairs -- each character is a fixed size.

Javascript to PHP domain.charCodeAt(i)

This should be a Unicode safe version.

$domain = "example.com";
$sum1 = 0;
$sum2 = 0;

// this will convert $domain to a UTF-16 string,
// without specifying the third parameter, PHP will
// assume the string uses PHP's internal encoding,
// you might want to explicitly set the `from_encoding`
$domain = mb_convert_encoding($domain, 'UTF-16');

$length = mb_strlen($domain, 'UTF-16');
$i = $length - 1;

for ( $i; $i >= 0; $i-- ) {
$char = mb_substr($domain, $i, 1, 'UTF-16');
$sum1 += hexdec(bin2hex($char)) * 13748600747;
$sum2 += hexdec(bin2hex($char)) * 40216416130;
}
$newsum = "$" . strval($sum1);
$sum2 = strval($sum2);
$x = substr($newsum,0,8) . substr($sum2,0,8);

echo $x;

The conversion to decimal is based of the code in this comment on the ord documentation.

Convert string to it's correspondent entity

You could use IntlChar::ord() to find the codepoint of a character. Below is a transpiled version:

$myStr = preg_replace_callback('~[\x{0022}\x{0027}\x{0080}-\x{ffff}]~u', function ($c) {
return '&#' . IntlChar::ord($c[0]) . ';';
}, $myStr);

See live demo

Two different special characters are equal

What can I do to make a valid comparison?

Updated Solution

Perhaps we're looking at this the wrong way. The problem may have nothing to do with PHP, but be about your code editor instead. Perhaps the editor is registering both chars as the same when you enter them, so PHP doesn't see any difference. Here is what you can do:

  1. Save each char in a file by itself, using an editor that recognizes the character, such as Wordpad
  2. Load the character in PHP with $char=file_get_contents('path/to/char.txt')
  3. Now that we've bypassed your code editor entirely, compare the two. If they're different, your editor might be to blame.

Original Solution

You could try to convert your characters to their ASCII values and compare the values instead of the characters

$ordUTF8 = function($char){
list(, $ord) = unpack('N', mb_convert_encoding($char, 'UCS-4BE', 'UTF-8'));
return $ord;
};
$char1= "Ø";
$char2= "v";
// 61656 and 61558 in my testing
$isEqual = $ordUTF8($char1)===$ordUTF8($char2);

Live demo. This solution was inspired by this accepted answer

Convert unicode symbols to \uXXXX, not using json_encode

Since your current solution uses the u regex modifier, I'm assuming your input is encoded as UTF-8.

The following solution is definitely not simpler (apart from the regex) and I'm not even sure it's faster, but it's more low-level and shows the actual escaping procedure.

$input = preg_replace_callback('#[^\x00-\x7f]#u', function($m) {
$utf16 = mb_convert_encoding($m[0], 'UTF-16BE', 'UTF-8');
if (strlen($utf16) <= 2) {
$esc = '\u' . bin2hex($utf16);
}
else {
$esc = '\u' . bin2hex(substr($utf16, 0, 2)) .
'\u' . bin2hex(substr($utf16, 2, 2));
}
return $esc;
}, $input);

One fundamental problem is that PHP doesn't have an ord function that works with UTF-8. You either have to use mb_convert_encoding, or you have to roll your own UTF-8 decoder (see linked question) which would allow for additional optimizations. Two- and three-byte UTF-8 sequences map to a single UTF-16 code unit. Four-byte sequences require two code units (high and low surrogate).

If you're aiming for simplicity and readability, you probably can't beat the json_encode approach.



Related Topics



Leave a reply



Submit