Conversion of Strings Like \\Uxxxx in Python

Conversion of strings like \\uXXXX in python

In 2.x:

>>> u'\\u0e4f\\u032f\\u0361\\u0e4f'.decode('unicode-escape')
u'\u0e4f\u032f\u0361\u0e4f'
>>> print u'\\u0e4f\\u032f\\u0361\\u0e4f'.decode('unicode-escape')
๏̯͡๏

In Python 3.8.2, how do I convert a string that contains a '\uxxxx' sequence into utf-8?

Optimally I would suggest you try to make sure the \ in your input is not escaped in the first place, but should that not be possible a regex substitution could do (s being your original string):

re.sub(r"\\u([0-9a-f]{4})", lambda m: chr(int(m.group(1), 16)), s)

Find occurrences of \\u followed by four lowercase hex digits and group them in the first group for back reference. Replace these sequences with character corresponding to he int value represented by those four digits: chr(int(m.group(1), 16)

Converting Unicode sequences to a string in Python 3

It appears your input uses backslash as an escape character, you should unescape the text before passing it to json:

>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש

Don't use 'unicode-escape' encoding on JSON text; it may produce different results:

>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'

'' == '\U0001F602' is U+1F602 (FACE WITH TEARS OF JOY).

Python. Replace \\uxxxx to \uxxxx

First: maybe you are not decoding the webpage with the correct charset. If the web server does not supply the charset you might have to find it in the meta tags or make an educated guess. Maybe try a couple of usual charsets and compare the results.

Second: I played around with strings and decoding for a while and it's really frustrating, but I found a possible solution in format():

s = "\\u00f3"
print('{:c}'.format(int(s[2:], 16)))

Format the extracted hex value as unicode seems to work.

How to convert %uXXXX code to plain text in python?

It seems you have a string where the backslash representation usually used to represent non-ascii characters has been replaced by a percent-sign based convention.

The solution is to replace the precent signs with backslashes - as you have tried - then encode to bytes and decode from the unicode-escape codec. The result will be a Python str.

>>> s = '%u0E1E%u0E1A%u0E40%u0E08%u0E2D%u0E02%u0E27%u0E14%u0E40'
>>> # Encode to latin-1 as it won't lose any information.
>>> result = s.replace('%', '\\').encode('latin-1').decode('unicode-escape')
>>> result
'พบเจอขวดเ'
>>> # Result is longer than we expected
>>> expected = 'พบเจอ'
>>> result == expected
False
>>> expected in result
True
>>> result.startswith(expected)
True

Converting unicode sequence to string in Python3 but allow paths in string

The input is ambiguous. The right answer does not exist in the general case. We could use heuristics that produce an output that looks right most of the time e.g., we could use a rule such as "if \uxxxx sequence (6 chars) is a part of an existing path then don't interpret it as a Unicode escape" and the same for \Uxxxxxxxx (10 chars) sequences e.g., an input that is similar to the one from the question: b"c:\\U0001f60f\\math.dll" can be interpreted differently depending on whether c:\U0001f60f\math.dll file actually exists on the disk:

#!/usr/bin/env python3
import re
from pathlib import Path

def decode_unicode_escape_if_path_doesnt_exist(m):
path = m.group(0)
return path if Path(path).exists() else replace_unicode_escapes(path)

def replace_unicode_escapes(text):
return re.sub(
fr"{unicode_escape}+",
lambda m: m.group(0).encode("latin-1").decode("raw-unicode-escape"),
text,
)

input_text = Path('broken.txt').read_text(encoding='ascii')
hex = "[0-9a-fA-F]"
unicode_escape = fr"(?:\\u{hex}{{4}}|\\U{hex}{{8}})"
drive_letter = "[a-zA-Z]"
print(
re.sub(
fr"{drive_letter}:\S*{unicode_escape}\S*",
decode_unicode_escape_if_path_doesnt_exist,
input_text,
)
)

Specify the actual encoding of your broken.txt file in the read_text() if there are non-ascii characters in the encoded text.

What specific regex to use to extract paths depends on the type of input that you get.

You could complicate the code by trying to substitute one possible Unicode sequence at a time (the number of replacements grows exponentially with the number of candidates in this case e.g., if there are 10 possible Unicode escape sequences in a path then there are 2**10 decoded paths to try).

How can I convert strings like \u5c0f\u738b\u5b50\u003a\u6c49\u6cd5\u82f1\u5bf9\u7167 to Chinese characters

Those are Unicode codepoints already. They represent Chinese characters, but using escape codes that are easier on the developer:

>>> print u'\u5c0f\u738b\u5b50\u003a\u6c49\u6cd5\u82f1\u5bf9\u7167'
小王子:汉法英对照

You do not have to do anything to convert those; the \uxxxx escape form is simply another way to express the same codepoint. See String Literals:

\uxxxx
Character with 16-bit hex value xxxx (Unicode only)

\Uxxxxxxxx
Character with 32-bit hex value xxxxxxxx (Unicode only)

Python interprets those escape codes when reading the source code to construct the unicode value.

If the source of the data is not from Python source code but from the web, you have JSON data instead, which uses the same escape format:

>>> import json
>>> print json.loads('"\u5c0f\u738b\u5b50\u003a\u6c49\u6cd5\u82f1\u5bf9\u7167"')
小王子:汉法英对照

Note that the value then needs to be part of a larger string, one that at least includes quotes to mark this a string.

Also note that the JSON string escape format differs from Python's when it comes to non-BMP (supplementary) codepoints; JSON treats those like UTF-16 does, by creating a surrogate pair and use two \uxxxx sequences for such a codepoint. In Python you'd use a \Uhhhhhhhh 32-bit hex value.

How to convert UTF-8 notation to python unicode notation

you are struggling with the representation of something versus its value...

import re
re.sub("u\+([0-9a-f]{4})",lambda m:chr(int(m.group(1),16)),s)

but for u+00a0 this becomes \xa0

but same with the literal \u00a0

s = "\u00a0"
print(repr(s))

once you have the proper value as a unicode string you can then encode it to utf8

s = "\xa0"
print(s.encode('utf8'))
# b'\xc2\xa0'

so just final answer here

import re
s = "u+00a0"
s2 = re.sub("u\+([0-9a-f]{4})",lambda m:chr(int(m.group(1),16)),s)
s_bytes = s2.encode('utf8') # b'\xc2\xa0'

ast.literal_eval will convert unicode code point from \uxxxx to \\uxxxx, how to avoid?

json.loads("{\"title\": \"\\u5927\"}") will return a dictionary, so you don't need the ast.literal_eval at all.

d = json.loads("{\"title\": \"\\u5927\"}")

print d
{u'title': u'\u5927'}

type(d)
Out[2]: dict

For the full json.loads() json to python conversion, please see this.

If you're trying to parse a file, use json.load() without the s like this:

with open('your-file.json') as f:
# you can change the encoding to the one you need
print json.load(f, encoding='utf-8')

Test:

from io import StringIO

s = StringIO(u"{\"title\": \"\\u5927\"}")

print json.load(s)
{u'title': u'\u5927'}

Update

OP has totally changed what the json should be parsed, here is another solution, parse the json again:

json.loads(json.loads(u"\"{\\\"title\\\": \\\"\\\\u5927\\\"}\""))
Out[6]: {u'title': u'\u5927'}

This is because the first json.loads convert the string (non-json) to a json string, parse it again with json.loads will deserialize it eventually.

Python unicode strings

The issue is, according to the docs (read down a little bit, between the escape sequences tables), the \u, \U, and \N Unicode escape sequences are only recognized in string literals. That means that once the literal is evaluated in memory, such as in a variable assignment:

s = "\u00A1 ATENCI\u00D3N! \u25C4"

any attempt to str.encode() it automatically converts it to a bytes object that uses \x where it can:

b'\\xa1 ATENCI\\xd3N! \\u25c4'

Using

b'\\xa1 ATENCI\\xd3N! \\u25c4'.decode("unicode_escape")

will convert it back to '¡ ATENCIÓN! ◄'. This uses the actual (intended) representation of the characters, and not the \uXXXX escape sequences of the original string s.

So, what you should do is not mess around with encoding and decoding things. Observe:

print("\u00A1 ATENCI\u00D3N! \u25C4" == '¡ ATENCIÓN! ◄')
True

That's all the comparison you need to do.

For further reading, you may be interested in:

  • How to work with surrogate pairs in Python?
  • Encodings and Unicode from the Python docs.


Related Topics



Leave a reply



Submit