Convert byte Encoding to unicode
How about this:
x <- "bi<df>chen Z<fc>rcher hello world <c6>"
m <- gregexpr("<[0-9a-f]{2}>", x)
codes <- regmatches(x, m)
chars <- lapply(codes, function(x) {
rawToChar(as.raw(strtoi(paste0("0x", substr(x,2,3)))), multiple = TRUE)
})
regmatches(x, m) <- chars
x
# [1] "bi\xdfchen Z\xfcrcher hello world \xc6"
Encoding(x) <- "latin1"
x
# [1] "bißchen Zürcher hello world Æ"
Note that you can't make an escaped character by pasting a "\x" to the front of a number. That "\x" really isn't in the string at all. It's just how R chooses to represent it on screen. Here use use rawToChar()
to turn a number into the character we want.
I tested this on a Mac so I had to set the encoding to "latin1" to see the correct symbols in the console. Just using a single byte like that isn't proper UTF-8.
When converting a utf-8 encoded string from bytes to characters, how does the computer know where a character ends?
The first byte of a multibyte sequence encodes the length of the sequence in the number of leading 1-bits:
0xxxxxxx
is a character on its own;10xxxxxx
is a continuation of a multibyte character;110xxxxx
is the first byte of a 2-byte character;1110xxxx
is the first byte of a 3-byte character;11110xxx
is the first byte of a 4-byte character.
Bytes with more than 4 leading 1-bits don't encode valid characters in UTF-8 because the 4-byte sequences already cover more than the entire Unicode range from U+0000 to U+10FFFF.
So, the example posed in the question has one ASCII character and one continuation byte that doesn't encode a character on its own.
Java - Convert byte[] to char[]. The encoding is UTF-16
You can try this:
byte[] b = ...
char[] c = new String(b, "UTF-16").toCharArray();
From String(byte[] bytes, String charsetName)
:
Constructs a new
String
by decoding the specified array of bytes using the specified charset.
Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode
You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using ast.literal_eval()
(credits to Mark Tolonen for the suggestion), then a simple decode()
will do the job.
>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'
Since you were the one who generated the strings, using eval()
would be safe, but why not be safer?
How can I convert a 4-byte string into an unicode emoji?
The character has the Unicode code point U+1F642. Displaying text is defined thru an encoding: how a set of bytes has to be interpreted:
- in UTF-8 one character can consist of 8, 16, 24 or 32 bits (1 to 4
Byte
s); this one is$F0 $9F $99 $82
. - in UTF-16 one character can consist of 16 or 32 bits (2 or 4 bytes = 1 or 2
Word
s); this one is$D83D $DE42
(using surrogates). - in UTF-32 one character always consists of 32 bits (4 bytes = 1
Cardinal
orDWord
) and always equals to the code point, that is$1F642
.
In Delphi, you can use:
TEncoding.UTF8.GetString()
for UTF-8- (or
TEncoding.Unicode.GetString()
if you'd have UTF-16LE - and
TEncoding.BigEndianUnicode.GetString()
if you'd have UTF-16BE).
Keep in mind that is just a character like each letter, symbol and whitespace of this text: it can be marked thru selection (i.e. Ctrl+A) and copied to the clipboard (i.e. Ctrl+C). No special care is needed.
Conversion of a unicode character from byte
You should use Encoding.GetString
, using the most appropriate encoding.
I don't quite understand your situation fully, but the Encoding
class is almost certain to be the way to handle it.
Who is in control of the data here? Your code, or that of your customers? Have you defined what the correct format is?
EDIT: Okay, I've had another look at your code: BitConverter.ToChar returns "A character formed by two bytes beginning at startIndex." If you only want to use one byte, just cast it:
char c = (char) buffer[m_index];
I'm surprised your code has been working at all, as it would be breaking any time the next byte was non-zero.
Converting byte string in unicode string
In strings (or Unicode objects in Python 2), \u
has a special meaning, namely saying, "here comes a Unicode character specified by it's Unicode ID". Hence u"\u0432"
will result in the character в.
The b''
prefix tells you this is a sequence of 8-bit bytes, and bytes object has no Unicode characters, so the \u
code has no special meaning. Hence, b"\u0432"
is just the sequence of the bytes \
,u
,0
,4
,3
and 2
.
Essentially you have an 8-bit string containing not a Unicode character, but the specification of a Unicode character.
You can convert this specification using the unicode escape encoder.
>>> c.decode('unicode_escape')
'в'
Related Topics
How to Prevent Blogdown from Rerendering All Posts
How to Pass R Variable into SQLdf
Character String Is Not in a Standard Unambiguous Format
Npc Coordinates of Geom_Point in Ggplot2
How to Highlight Area Between Two Lines? Ggplot
How to Determine If a Url Object Returns '404 Not Found'
Place Text Values to Right of Sankey Diagram
R: in Barplot Midpoints Are Not Centered W.R.T. Bars
Set Standard Legend Key Size with Long Label Names Ggplot
Ggplot: How to Produce a Gradient Fill Within a Geom_Polygon
Using Tidy Eval for Multiple Dplyr Filter Conditions
Logistic Regression: How to Try Every Combination of Predictors in R
Sum Columns by Group (Row Names) in a Matrix
Reshape R Data with User Entries in Rows, Collapsing for Each User
Function for Polynomials of Arbitrary Order (Symbolic Method Preferred)