How to Get Code Point Number For a Given Character in a Utf-8 String

How to get code point number for a given character in a utf-8 string?

Scott Reynen wrote a function to convert UTF-8 into Unicode. I found it looking at the PHP documentation.

function utf8_to_unicode( $str ) {

$unicode = array();
$values = array();
$lookingFor = 1;

for ($i = 0; $i < strlen( $str ); $i++ ) {
$thisValue = ord( $str[ $i ] );
if ( $thisValue < ord('A') ) {
// exclude 0-9
if ($thisValue >= ord('0') && $thisValue <= ord('9')) {
// number
$unicode[] = chr($thisValue);
}
else {
$unicode[] = '%'.dechex($thisValue);
}
} else {
if ( $thisValue < 128)
$unicode[] = $str[ $i ];
else {
if ( count( $values ) == 0 ) $lookingFor = ( $thisValue < 224 ) ? 2 : 3;
$values[] = $thisValue;
if ( count( $values ) == $lookingFor ) {
$number = ( $lookingFor == 3 ) ?
( ( $values[0] % 16 ) * 4096 ) + ( ( $values[1] % 64 ) * 64 ) + ( $values[2] % 64 ):
( ( $values[0] % 32 ) * 64 ) + ( $values[1] % 64 );
$number = dechex($number);
$unicode[] = (strlen($number)==3)?"%u0".$number:"%u".$number;
$values = array();
$lookingFor = 1;
} // if
} // if
}
} // for
return implode("",$unicode);

} // utf8_to_unicode

How to efficiently convert between unicode code points and UTF-8 literals in python?

Actually I don't think you need to go via utf-8 at all here. int will give you the codepoint

>>> int('00A1', 16)
161

And then it's just chr

>>> chr(161)
'¡'

UTF-8 to code point

You can use java.nio.charset.CharsetDecoder to do that. You'll need a ByteBuffer and a CharBuffer. Put the data into ByteBuffer, then use CharsetDecoder.decode(ByteBuffer in, CharBuffer out, boolean endOfInput) to read into the CharBuffer. Then you can get the code point using Character.codePointAt(char[] a, int index). It is important to use this method because if your text has characters outside the BMP, they will be translated into two chars, so it's not sufficient to read only one char.

With this method you only need to create two buffers once, after that no new objects will be created unless some error occurs.

How can I get the Unicode value of a character in go?

Strings are utf8 encoded, so to decode a character from a string to get the rune (unicode code point), you can use the unicode/utf8 package.

Example:

package main

import (
"fmt"
"unicode/utf8"
)

func main() {
str := "AÅÄÖ"

for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%d %v\n", r, size)

str = str[size:]
}
}

Result:

65 1

197 2

196 2

214 2

Edit: (To clarify Michael's supplement)

A character such as Ä may be created using different unicode code points:

Precomposed: Ä (U+00C4)

Using combining diaeresis: A (U+0041) + ¨ (U+0308)

In order to get the precomposed form, one can use the normalization package, golang.org/x/text/unicode/norm. The NFC (Canonical Decomposition,
followed by Canonical Composition) form will turn U+0041 + U+0308 into U+00C4:

c := "\u0041\u0308"
r, _ := utf8.DecodeRune(norm.NFC.Bytes([]byte(c)))
fmt.Printf("%+q", r) // '\u00c4'

Turning a unicode code point into a unicode character in Python

You can use chr after parsing the number as base-16:

>>> chr(int('1212', 16))
'ሒ'
>>> '\u1212'
'ሒ'

If you're replacing this globally in some string, using re.sub with a substitution function could make this simple:

import re

def replacer(match):
if match.group(2) == 'u':
return chr(int(match.group(3), 16))
elif match.group(2) == 'x':
return # ...

re.sub(r'(\\(x|u)\{(.*?)\})', replacer, r'\x{abcd} foo \u{1212}')

UTF-8 to Unicode Code Points

Converting one character set to another can be done with iconv:

http://php.net/manual/en/function.iconv.php

Note that UTF is already an Unicode encoding.

Another way is simply using htmlentities with the right character set:

http://php.net/manual/en/function.htmlentities.php



Related Topics



Leave a reply



Submit