How to convert a UTF-8 string into Unicode?
So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string
. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.
public static string DecodeFromUtf8(this string utf8String)
{
// copy the string as UTF-8 bytes.
byte[] utf8Bytes = new byte[utf8String.Length];
for (int i=0;i<utf8String.Length;++i) {
//Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
utf8Bytes[i] = (byte)utf8String[i];
}
return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}
DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà
This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string
using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length)
.
Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:
public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
// the inverse of `mistake.GetString(originalBytes);`
byte[] originalBytes = mistake.GetBytes(mangledString);
return correction.GetString(originalBytes);
}
UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);
Python 3.6, utf-8 to unicode conversion, string with double backslashes
You have to encode/decode 4 times to get the desired result:
print(
"Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"
# actually any encoding support printable ASCII would work, for example utf-8
.encode('ascii')
# unescape the string
# source: https://stackoverflow.com/a/1885197
.decode('unicode-escape')
# latin-1 also works, see https://stackoverflow.com/q/7048745
.encode('iso-8859-1')
# finally
.decode('utf-8')
)
Try it online!
Besides, consider telling your target program (data source) to give different output format (byte array or base64 encoded, for example), if you can.
The unsafe-but-shorter way:
st = "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"
print(eval("b'"+st+"'").decode('utf-8'))
Try it online!
There are ast.literal_eval
, but it may not worth using here.
How can I convert all UTF8 Unicode characters in a string to their relevant Codepoints using bash/shell/zsh?
Using perl
, works in any shell as long as the arguments are encoded in UTF-8:
$ perl -CA -E 'for my $arg (@ARGV) { say map { my $cp = ord; $cp > 127 ? sprintf "U+%04X", $cp : $_ } split //, $arg }' "Myquot;
MyU+1F4D4
Non-ASCII codepoints are printed as U+XXXX (0-padded, more hex digits if needed), ASCII ones as human-readable letters.
Or for maximum speed, a C program that does the same:
// Compile with: gcc -o print_unicode -std=c11 -O -Wall -Wextra print_unicode.c
#include <assert.h>
#include <inttypes.h>
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <uchar.h>
#if __STDC_VERSION__ < 201112L
#error "Need C11 or newer"
#endif
#ifndef __STDC_UTF_32__
#error "Not using Unicode"
#endif
int main(int argc, char **argv) {
// arguments should be encoded according to locale's character set
setlocale(LC_CTYPE, "");
for (int i = 1; i < argc; i++) {
char *s = argv[i];
size_t len = strlen(argv[i]);
mbstate_t state;
memset(&state, 0, sizeof state);
while (len > 0) {
char32_t c;
size_t rc = mbrtoc32(&c, s, len, &state);
assert(rc != (size_t)-3);
if (rc == (size_t)-1) {
perror("mbrtoc32");
return EXIT_FAILURE;
} else if (rc == (size_t)-2) {
fprintf(stderr, "Argument %d is incomplete!\n", i);
return EXIT_FAILURE;
} else if (rc > 0) {
if (c > 127) {
printf("U+%04" PRIXLEAST32, c);
} else {
putchar((char)c);
}
s += rc;
len -= rc;
}
putchar('\n');
}
return 0;
}
$ ./print_unicode "Myquot;
MyU+1F4D4
How to convert string to unicode(UTF-8) string in Swift?
Use this code,
let str = String(UTF8String: strToDecode.cStringUsingEncoding(NSUTF8StringEncoding))
hope its helpful
How to convert a string of utf-8 bytes into a unicode emoji in python
Yes, I encountered the same problem when trying to decode a Facebook message dump. Here's how I solved it:
string = "\u00f0\u009f\u0098\u0086".encode("latin-1").decode("utf-8")
# ''
Here's why:
- This emoji takes 4 bytes to encode in UTF-8 (
F0 9F 98 86
, check at the bottom of this page) - Facebook could have used UTF-8 for the JSON file but they instead chose printable ASCII only. So it encodes those 4 bytes as
\u00F0\u009F\u0098\u0086
encode("latin-1")
was a convenient way to convert these encodings back to the raw bytes.decode("utf-8")
convert the raw bytes into a Unicode character.
Java: how to convert UTF-8 (in literal) to unicode
System.out.println(new String(new byte[] {
(byte)0xE2, (byte)0x80, (byte)0x93 }, "UTF-8"));
prints an em-dash, which is what those three bytes encode. It is not clear from your question whether you have such three bytes, or literally the string you have posted. If you have the string, then simply parse it into bytes beforehand, for example with the following:
final String[] bstrs = "\\xE2\\x80\\x93".split("\\\\x");
final byte[] bytes = new byte[bstrs.length-1];
for (int i = 1; i < bstrs.length; i++)
bytes[i] = (byte) ((Integer.parseInt(bstrs[i], 16) << 24) >> 24);
System.out.println(new String(bytes, "UTF-8"));
Encode UTF8 text to Unicode C#
byte[] utf8Bytes = new byte[text_txt.Length];
for (int i = 0; i < text_txt.Length; ++i)
{
//Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
utf8Bytes[i] = (byte)text_txt[i];
}
text_txt= Encoding.UTF8.GetString(utf8Bytes, 0, text_txt.Length);
from answer: How to convert a UTF-8 string into Unicode?
How to efficiently convert between unicode code points and UTF-8 literals in python?
Actually I don't think you need to go via utf-8 at all here. int
will give you the codepoint
>>> int('00A1', 16)
161
And then it's just chr
>>> chr(161)
'¡'
When converting a utf-8 encoded string from bytes to characters, how does the computer know where a character ends?
The first byte of a multibyte sequence encodes the length of the sequence in the number of leading 1-bits:
0xxxxxxx
is a character on its own;10xxxxxx
is a continuation of a multibyte character;110xxxxx
is the first byte of a 2-byte character;1110xxxx
is the first byte of a 3-byte character;11110xxx
is the first byte of a 4-byte character.
Bytes with more than 4 leading 1-bits don't encode valid characters in UTF-8 because the 4-byte sequences already cover more than the entire Unicode range from U+0000 to U+10FFFF.
So, the example posed in the question has one ASCII character and one continuation byte that doesn't encode a character on its own.
Related Topics
Convert Arabic"Unicode" Content HTML or Xml to PDF Using Itextsharp
Establish a Link Between Two Lists in Linq to Entities Where Clause
Serialization of Entity Framework Objects with One to Many Relationship
System.Text.JSON: How to Specify a Custom Name for an Enum Value
Word Wrap a String in Multiple Lines
How to Serialize/Deserialize a Dictionary with Custom Keys Using JSON.Net
Setting Generic Type at Runtime
No Type Inference with Generic Extension Method
For VS. Linq - Performance VS. Future
Entity Framework/Linq Expression Converting from String to Int
Determine the File Type Using C#
Passing an Empty Array as Default Value of an Optional Parameter
Implement Validation for Wpf Textboxes
Validation: How to Inject a Model State Wrapper with Ninject