Convert Byte Encoding to Unicode

Convert byte Encoding to unicode

How about this:

x <- "bi<df>chen Z<fc>rcher hello world <c6>"

m <- gregexpr("<[0-9a-f]{2}>", x)
codes <- regmatches(x, m)
chars <- lapply(codes, function(x) {
rawToChar(as.raw(strtoi(paste0("0x", substr(x,2,3)))), multiple = TRUE)
})

regmatches(x, m) <- chars

x
# [1] "bi\xdfchen Z\xfcrcher hello world \xc6"

Encoding(x) <- "latin1"
x
# [1] "bißchen Zürcher hello world Æ"

Note that you can't make an escaped character by pasting a "\x" to the front of a number. That "\x" really isn't in the string at all. It's just how R chooses to represent it on screen. Here use use rawToChar() to turn a number into the character we want.

I tested this on a Mac so I had to set the encoding to "latin1" to see the correct symbols in the console. Just using a single byte like that isn't proper UTF-8.

When converting a utf-8 encoded string from bytes to characters, how does the computer know where a character ends?

The first byte of a multibyte sequence encodes the length of the sequence in the number of leading 1-bits:

  • 0xxxxxxx is a character on its own;
  • 10xxxxxx is a continuation of a multibyte character;
  • 110xxxxx is the first byte of a 2-byte character;
  • 1110xxxx is the first byte of a 3-byte character;
  • 11110xxx is the first byte of a 4-byte character.

Bytes with more than 4 leading 1-bits don't encode valid characters in UTF-8 because the 4-byte sequences already cover more than the entire Unicode range from U+0000 to U+10FFFF.

So, the example posed in the question has one ASCII character and one continuation byte that doesn't encode a character on its own.

Java - Convert byte[] to char[]. The encoding is UTF-16

You can try this:

byte[] b = ...
char[] c = new String(b, "UTF-16").toCharArray();

From String(byte[] bytes, String charsetName):

Constructs a new String by decoding the specified array of bytes using the specified charset.

Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using ast.literal_eval() (credits to Mark Tolonen for the suggestion), then a simple decode() will do the job.

>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'

Since you were the one who generated the strings, using eval() would be safe, but why not be safer?

How can I convert a 4-byte string into an unicode emoji?

The character has the Unicode code point U+1F642. Displaying text is defined thru an encoding: how a set of bytes has to be interpreted:

  • in UTF-8 one character can consist of 8, 16, 24 or 32 bits (1 to 4 Bytes); this one is $F0 $9F $99 $82.
  • in UTF-16 one character can consist of 16 or 32 bits (2 or 4 bytes = 1 or 2 Words); this one is $D83D $DE42 (using surrogates).
  • in UTF-32 one character always consists of 32 bits (4 bytes = 1 Cardinal or DWord) and always equals to the code point, that is $1F642.

In Delphi, you can use:

  • TEncoding.UTF8.GetString() for UTF-8
  • (or TEncoding.Unicode.GetString() if you'd have UTF-16LE
  • and TEncoding.BigEndianUnicode.GetString() if you'd have UTF-16BE).

Keep in mind that is just a character like each letter, symbol and whitespace of this text: it can be marked thru selection (i.e. Ctrl+A) and copied to the clipboard (i.e. Ctrl+C). No special care is needed.

Conversion of a unicode character from byte

You should use Encoding.GetString, using the most appropriate encoding.

I don't quite understand your situation fully, but the Encoding class is almost certain to be the way to handle it.

Who is in control of the data here? Your code, or that of your customers? Have you defined what the correct format is?

EDIT: Okay, I've had another look at your code: BitConverter.ToChar returns "A character formed by two bytes beginning at startIndex." If you only want to use one byte, just cast it:

char c = (char) buffer[m_index];

I'm surprised your code has been working at all, as it would be breaking any time the next byte was non-zero.

Converting byte string in unicode string

In strings (or Unicode objects in Python 2), \u has a special meaning, namely saying, "here comes a Unicode character specified by it's Unicode ID". Hence u"\u0432" will result in the character в.

The b'' prefix tells you this is a sequence of 8-bit bytes, and bytes object has no Unicode characters, so the \u code has no special meaning. Hence, b"\u0432" is just the sequence of the bytes \,u,0,4,3 and 2.

Essentially you have an 8-bit string containing not a Unicode character, but the specification of a Unicode character.

You can convert this specification using the unicode escape encoder.

>>> c.decode('unicode_escape')
'в'


Related Topics



Leave a reply



Submit