text with unicode escape sequences to unicode in python
>>> print('test \\u0259'.decode('unicode-escape'))
test ə
How do convert unicode escape sequences to unicode characters in a python string
Assuming Python sees the name as a normal string, you'll first have to decode it to unicode:
>>> name
'Christensen Sk\xf6ld'
>>> unicode(name, 'latin-1')
u'Christensen Sk\xf6ld'
Another way of achieving this:
>>> name.decode('latin-1')
u'Christensen Sk\xf6ld'
Note the "u" in front of the string, signalling it is uncode. If you print this, the accented letter is shown properly:
>>> print name.decode('latin-1')
Christensen Sköld
BTW: when necessary, you can use de "encode" method to turn the unicode into e.g. a UTF-8 string:
>>> name.decode('latin-1').encode('utf-8')
'Christensen Sk\xc3\xb6ld'
How to escape unicode special chars in string and write it to UTF encoded file
Another solution, not relying on the built-in repr()
but rather implementing it from scratch:
orig = 'Bitte überprüfen Sie, ob die Dokumente erfolgreich in System eingereicht wurden, und löschen Sie dann die tatsächlichen Dokumente.'
enc = re.sub('[^ -~]', lambda m: '\\u%04X' % ord(m[0]), orig)
print(enc)
Differences:
- Encodes only using
\u
, never any other sequence, whereasrepr()
uses about a third of the alphabet (so for example the BEL character will be encoded as\u0007
rather than\a
) - Upper-case encoding, as specified (
\u00FC
rather than\u00fc
) - Does not handle unicode characters outside plane 0 (could be extended easily, given a spec for how those should be represented)
- It does not take care of any pre-existing
\u
sequences, whereasrepr()
turns those into\\u
; could be extended, perhaps to encode\
as\u005C
:enc = re.sub(r'[^ -[\]-~]', lambda m: '\\u%04X' % ord(m[0]), orig)
How to print out strings with unicode escape characters correctly
The \u00e0
is being stored as a Unicode number for python so that it is printed as a 'à'. When you get it from another file, it is completely in string form meaning it is then stored as a '\\u00e0'
where every character is a string.
A solution to this would be to identify where the '\\u00e0'
is in the list and then replace it with the '\u00e0'
Here is some code that will convert the '\\u00e0'
in the string into the character its supposed to be.
def special_char_fix(string):
string = list(string)
for pl, char in enumerate(string):
if char == '\\':
val = ''.join([string[pl + k + 2] for k in range(4)])
for k in range(5):
string.pop(pl)
string[pl] = str(chr(int(val, 16)))
return ''.join(string)
Convert a string into unicode escape sequences
The ord()
function returns the Unicode code point of a character. Just format this as \u
followed by a 4-digit hex representation of that.
def unicode_escape(s):
return "".join(map(lambda c: rf"\u{ord(c):04x}", s))
print(unicode_escape("Hello, World!\n"))
# prints \u0048\u0065\u006c\u006c\u006f\u002c\u0020\u0057\u006f\u0072\u006c\u0064\u0021\u000a
encode unicode characters to unicode escape sequences
If you want to get Unicode escapes similar to Java in Python; you could use JSON format:
>>> import json
>>> import sys
>>> s = u'Örnsköldsvik;SE;Ornskoldsvik;Ångermanlandsgatan 28 A'
>>> json.dump(s, sys.stdout)
"\u00d6rnsk\u00f6ldsvik;SE;Ornskoldsvik;\u00c5ngermanlandsgatan 28 A"
There is also, unicode-escape
codec but you shouldn't use it: it produces Python-specific escaping (how Python Unicode string literals look like):
>>> print s.encode('unicode-escape')
\xd6rnsk\xf6ldsvik;SE;Ornskoldsvik;\xc5ngermanlandsgatan 28 A
How to print unicode escape sequence from unicode strings in python(3)?
>>> s = "नमस्ते"
>>> s.encode('utf-8')
b'\xe0\xa4\xa8\xe0\xa4\xae\xe0\xa4\xb8\xe0\xa5\x8d\xe0\xa4\xa4\xe0\xa5\x87'
>>> s.encode('unicode-escape')
b'\\u0928\\u092e\\u0938\\u094d\\u0924\\u0947'
How do I convert unicode to unicode-escaped text
You need to encode it again with unicode-escape
encoding.
>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'
Code modified (used binary mode to reduce unnecessary encode/decodes)
with open("input.txt", "rb") as f:
text = f.read().rstrip() # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
f.write(decoded.encode('unicode-escape'))
http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq
Related Topics
How to Read the Rgb Value of a Given Pixel in Python
Remove and Replace Printed Items
Extract Images from PDF Without Resampling, in Python
What Exactly Is File.Flush() Doing
How to Ignore Deprecation Warnings in Python
How to Export Keras .H5 to Tensorflow .Pb
How to Get the Ip Address from a Nic (Network Interface Controller) in Python
How to Use Argsort in Descending Order
Python Numpy Valueerror: Operands Could Not Be Broadcast Together with Shapes
Python Multithreading Wait Till All Threads Finished
How to Plot Multiple Seaborn Jointplot in Subplot
Python' Is Not Recognized as an Internal or External Command
Why Does Python Code Use Len() Function Instead of a Length Method
How to Use Angularjs with the Jinja2 Template Engine
Decode Escaped Characters in Url