Text with Unicode Escape Sequences to Unicode in Python

text with unicode escape sequences to unicode in python

>>> print('test \\u0259'.decode('unicode-escape'))
test ə

How do convert unicode escape sequences to unicode characters in a python string

Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:

>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'

Another way of achieving this:

>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'

Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:

>>> print name.decode('latin-1')
Christensen Sköld

BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:

>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'

How to escape unicode special chars in string and write it to UTF encoded file

Another solution, not relying on the built-in repr() but rather implementing it from scratch:

orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'

enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)

print(enc)

Differences:

  • Encodes only using \u, never any other sequence, whereas repr() uses about a third of the alphabet (so for example the BEL character will be encoded as \u0007 rather than \a)
  • Upper-case encoding, as specified (\u00FC rather than \u00fc)
  • Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
  • It does not take care of any pre-existing \u sequences, whereas repr() turns those into \\u; could be extended, perhaps to encode \ as \u005C:
    enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)

How to print out strings with unicode escape characters correctly

The \u00e0 is being stored as a Unicode number for python so that it is printed as a 'à'. When you get it from another file, it is completely in string form meaning it is then stored as a '\\u00e0' where every character is a string.
A solution to this would be to identify where the '\\u00e0' is in the list and then replace it with the '\u00e0'

Here is some code that will convert the '\\u00e0' in the string into the character its supposed to be.

def special_char_fix(string):
string = list(string)
for pl, char in enumerate(string):
if char == '\\':
val = ''.join([string[pl + k + 2] for k in range(4)])
for k in range(5):
string.pop(pl)
string[pl] = str(chr(int(val, 16)))
return ''.join(string)

Convert a string into unicode escape sequences

The ord() function returns the Unicode code point of a character. Just format this as \u followed by a 4-digit hex representation of that.

def unicode_escape(s):
return "".join(map(lambda c: rf"\u{ord(c):04x}", s))
print(unicode_escape("Hello, World!\n"))
# prints \u0048\u0065\u006c\u006c\u006f\u002c\u0020\u0057\u006f\u0072\u006c\u0064\u0021\u000a

encode unicode characters to unicode escape sequences

If you want to get Unicode escapes similar to Java in Python; you could use JSON format:

>>> import json
>>> import sys
>>> s = u'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A'
>>> json.dump(s, sys.stdout)
"\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A"

There is also, unicode-escape codec but you shouldn't use it: it produces Python-specific escaping (how Python Unicode string literals look like):

>>> print s.encode('unicode-escape')
\xd6rnsk\xf6ldsvik;SE;Ornskoldsvik;\xc5ngermanlandsgatan 28 A

How to print unicode escape sequence from unicode strings in python(3)?

>>> s = "नमस्ते"
>>> s.encode('utf-8')
b'\xe0\xa4\xa8\xe0\xa4\xae\xe0\xa4\xb8\xe0\xa5\x8d\xe0\xa4\xa4\xe0\xa5\x87'
>>> s.encode('unicode-escape')
b'\\u0928\\u092e\\u0938\\u094d\\u0924\\u0947'

How do I convert unicode to unicode-escaped text

You need to encode it again with unicode-escape encoding.

>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'

Code modified (used binary mode to reduce unnecessary encode/decodes)

with open("input.txt", "rb") as f:
text = f.read().rstrip() # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
f.write(decoded.encode('unicode-escape'))

http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq



Related Topics



Leave a reply



Submit