How to get code point number for a given character in a utf-8 string?
Scott Reynen wrote a function to convert UTF-8 into Unicode. I found it looking at the PHP documentation.
function utf8_to_unicode( $str ) {
$unicode = array();
$values = array();
$lookingFor = 1;
for ($i = 0; $i < strlen( $str ); $i++ ) {
$thisValue = ord( $str[ $i ] );
if ( $thisValue < ord('A') ) {
// exclude 0-9
if ($thisValue >= ord('0') && $thisValue <= ord('9')) {
// number
$unicode[] = chr($thisValue);
}
else {
$unicode[] = '%'.dechex($thisValue);
}
} else {
if ( $thisValue < 128)
$unicode[] = $str[ $i ];
else {
if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
$values[] = $thisValue;
if ( count( $values ) == $lookingFor ) {
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );
$number = dechex($number);
$unicode[] = (strlen($number)==3)?"%u0".$number:"%u".$number;
$values = array();
$lookingFor = 1;
} // if
} // if
}
} // for
return implode("",$unicode);
} // utf8_to_unicode
How to efficiently convert between unicode code points and UTF-8 literals in python?
Actually I don't think you need to go via utf-8 at all here. int
will give you the codepoint
>>> int('00A1', 16)
161
And then it's just chr
>>> chr(161)
'¡'
UTF-8 to code point
You can use java.nio.charset.CharsetDecoder to do that. You'll need a ByteBuffer
and a CharBuffer
. Put the data into ByteBuffer
, then use CharsetDecoder.decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
to read into the CharBuffer
. Then you can get the code point using Character.codePointAt(char[] a, int index)
. It is important to use this method because if your text has characters outside the BMP, they will be translated into two chars, so it's not sufficient to read only one char.
With this method you only need to create two buffers once, after that no new objects will be created unless some error occurs.
How can I get the Unicode value of a character in go?
Strings are utf8 encoded, so to decode a character from a string to get the rune
(unicode code point), you can use the unicode/utf8
package.
Example:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
str := "AÅÄÖ"
for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%d %v\n", r, size)
str = str[size:]
}
}
Result:
65 1
197 2
196 2
214 2
Edit: (To clarify Michael's supplement)
A character such as Ä
may be created using different unicode code points:
Precomposed: Ä
(U+00C4)
Using combining diaeresis: A
(U+0041) + ¨
(U+0308)
In order to get the precomposed form, one can use the normalization package, golang.org/x/text/unicode/norm
. The NFC (Canonical Decomposition,
followed by Canonical Composition) form will turn U+0041 + U+0308 into U+00C4:
c := "\u0041\u0308"
r, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(c)))
fmt.Printf("%+q", r) // '\u00c4'
Turning a unicode code point into a unicode character in Python
You can use chr
after parsing the number as base-16:
>>> chr(int('1212', 16))
'ሒ'
>>> '\u1212'
'ሒ'
If you're replacing this globally in some string, using re.sub
with a substitution function could make this simple:
import re
def replacer(match):
if match.group(2) == 'u':
return chr(int(match.group(3), 16))
elif match.group(2) == 'x':
return # ...
re.sub(r'(\\(x|u)\{(.*?)\})', replacer, r'\x{abcd} foo \u{1212}')
UTF-8 to Unicode Code Points
Converting one character set to another can be done with iconv:
http://php.net/manual/en/function.iconv.php
Note that UTF is already an Unicode encoding.
Another way is simply using htmlentities with the right character set:
http://php.net/manual/en/function.htmlentities.php
Related Topics
Difference in Accessing Arrays in PHP 5.3 and 5.4 or Some Configuration Mismatch
When Should I Use Memcache Instead of Memcached
What Is Autoloading; How to Use Spl_Autoload, _Autoload and Spl_Autoload_Register
Format Bytes to Kilobytes, Megabytes, Gigabytes
Which $_Server Variables Are Safe
How to Insert Element into Arrays At Specific Position
How to Check If an Email Address Is Real or Valid Using PHP
What's the Deal With a Leading Underscore in PHP Class Methods
Connecting to Ws-Security Protected Web Service With PHP
PHP Sort Array by Two Field Values
Include PHP Inside JavaScript (.Js) Files
How to Iterate Over the Results in a MySQLi Result Set
Fastest Way to Serve a File Using PHP
How Are Echo and Print Different in PHP