How to Convert a Utf-8 String into Unicode

How to convert a UTF-8 string into Unicode?

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

public static string DecodeFromUtf8(this string utf8String)
{
// copy the string as UTF-8 bytes.
byte[] utf8Bytes = new byte[utf8String.Length];
for (int i=0;i<utf8String.Length;++i) {
//Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
utf8Bytes[i] = (byte)utf8String[i];
}

return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).


Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
// the inverse of `mistake.GetString(originalBytes);`
byte[] originalBytes = mistake.GetBytes(mangledString);
return correction.GetString(originalBytes);
}

UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);

Python 3.6, utf-8 to unicode conversion, string with double backslashes

You have to encode/decode 4 times to get the desired result:

print(
"Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"

# actually any encoding support printable ASCII would work, for example utf-8
.encode('ascii')

# unescape the string
# source: https://stackoverflow.com/a/1885197
.decode('unicode-escape')

# latin-1 also works, see https://stackoverflow.com/q/7048745
.encode('iso-8859-1')

# finally
.decode('utf-8')
)

Try it online!

Besides, consider telling your target program (data source) to give different output format (byte array or base64 encoded, for example), if you can.

The unsafe-but-shorter way:

st = "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"
print(eval("b'"+st+"'").decode('utf-8'))

Try it online!

There are ast.literal_eval, but it may not worth using here.

How can I convert all UTF8 Unicode characters in a string to their relevant Codepoints using bash/shell/zsh?

Using perl, works in any shell as long as the arguments are encoded in UTF-8:

$ perl -CA -E 'for my $arg (@ARGV) { say map { my $cp = ord; $cp > 127 ? sprintf "U+%04X", $cp : $_ } split //, $arg }' "Myquot;
MyU+1F4D4

Non-ASCII codepoints are printed as U+XXXX (0-padded, more hex digits if needed), ASCII ones as human-readable letters.


Or for maximum speed, a C program that does the same:

// Compile with: gcc -o print_unicode -std=c11 -O -Wall -Wextra print_unicode.c                                                                                                                                                                                 
#include <assert.h>
#include <inttypes.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <uchar.h>

#if __STDC_VERSION__ < 201112L
#error "Need C11 or newer"
#endif
#ifndef __STDC_UTF_32__
#error "Not using Unicode"
#endif

int main(int argc, char **argv) {
// arguments should be encoded according to locale's character set
setlocale(LC_CTYPE, "");

for (int i = 1; i < argc; i++) {
char *s = argv[i];
size_t len = strlen(argv[i]);
mbstate_t state;
memset(&state, 0, sizeof state);

while (len > 0) {
char32_t c;
size_t rc = mbrtoc32(&c, s, len, &state);
assert(rc != (size_t)-3);
if (rc == (size_t)-1) {
perror("mbrtoc32");
return EXIT_FAILURE;
} else if (rc == (size_t)-2) {
fprintf(stderr, "Argument %d is incomplete!\n", i);
return EXIT_FAILURE;
} else if (rc > 0) {
if (c > 127) {
printf("U+%04" PRIXLEAST32, c);
} else {
putchar((char)c);
}
s += rc;
len -= rc;
}
putchar('\n');
}
return 0;
}
$ ./print_unicode "Myquot;
MyU+1F4D4

How to convert string to unicode(UTF-8) string in Swift?

Use this code,

let str = String(UTF8String: strToDecode.cStringUsingEncoding(NSUTF8StringEncoding))

hope its helpful

How to convert a string of utf-8 bytes into a unicode emoji in python

Yes, I encountered the same problem when trying to decode a Facebook message dump. Here's how I solved it:

string = "\u00f0\u009f\u0098\u0086".encode("latin-1").decode("utf-8")
# ''

Here's why:

  1. This emoji takes 4 bytes to encode in UTF-8 (F0 9F 98 86, check at the bottom of this page)
  2. Facebook could have used UTF-8 for the JSON file but they instead chose printable ASCII only. So it encodes those 4 bytes as \u00F0\u009F\u0098\u0086
  3. encode("latin-1") was a convenient way to convert these encodings back to the raw bytes.
  4. decode("utf-8") convert the raw bytes into a Unicode character.

Java: how to convert UTF-8 (in literal) to unicode

System.out.println(new String(new byte[] {
(byte)0xE2, (byte)0x80, (byte)0x93 }, "UTF-8"));

prints an em-dash, which is what those three bytes encode. It is not clear from your question whether you have such three bytes, or literally the string you have posted. If you have the string, then simply parse it into bytes beforehand, for example with the following:

final String[] bstrs = "\\xE2\\x80\\x93".split("\\\\x");
final byte[] bytes = new byte[bstrs.length-1];
for (int i = 1; i < bstrs.length; i++)
bytes[i] = (byte) ((Integer.parseInt(bstrs[i], 16) << 24) >> 24);
System.out.println(new String(bytes, "UTF-8"));

Encode UTF8 text to Unicode C#

   byte[] utf8Bytes = new byte[text_txt.Length];
for (int i = 0; i < text_txt.Length; ++i)
{
//Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
utf8Bytes[i] = (byte)text_txt[i];
}
text_txt= Encoding.UTF8.GetString(utf8Bytes, 0, text_txt.Length);

from answer: How to convert a UTF-8 string into Unicode?

How to efficiently convert between unicode code points and UTF-8 literals in python?

Actually I don't think you need to go via utf-8 at all here. int will give you the codepoint

>>> int('00A1', 16)
161

And then it's just chr

>>> chr(161)
'¡'

When converting a utf-8 encoded string from bytes to characters, how does the computer know where a character ends?

The first byte of a multibyte sequence encodes the length of the sequence in the number of leading 1-bits:

  • 0xxxxxxx is a character on its own;
  • 10xxxxxx is a continuation of a multibyte character;
  • 110xxxxx is the first byte of a 2-byte character;
  • 1110xxxx is the first byte of a 3-byte character;
  • 11110xxx is the first byte of a 4-byte character.

Bytes with more than 4 leading 1-bits don't encode valid characters in UTF-8 because the 4-byte sequences already cover more than the entire Unicode range from U+0000 to U+10FFFF.

So, the example posed in the question has one ASCII character and one continuation byte that doesn't encode a character on its own.



Related Topics



Leave a reply



Submit