Convert a Unicode String to a String in Python (Containing Extra Symbols)

Convert a Unicode string to a string in Python (containing extra symbols)

See unicodedata.normalize

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'

Python: convert strings containing unicode code point back into normal characters

convert this string and others like it back to their original strings with unicode characters?

Yes, let file.txt content be

\u9001\u5206200000

then

with open("file.txt","rb") as f:
content = f.read()
text = content.decode("unicode_escape")
print(text)

output

送分200000

If you want to know more read Text Encodings in codecs built-in module docs

How to convert unicode string into normal text in python

You can use the unicode-escape codec to get rid of the doubled-backslashes and use the string effectively.

Assuming that title is a str, you will need to encode the string first before decoding back to unicode(str).

>>> t = title.encode('utf-8').decode('unicode-escape')
>>> t
'ისრაელი == იერუსალიმი'

If title is a bytes instance you can decode directly:

>>> t = title.decode('unicode-escape')
>>> t
'ისრაელი == იერუსალიმი'

Python: How to translate UTF8 String containing unicode decoded characters (Ok\u00c9 to Oké)

Ive found the problem. The encoding decoding was wrong. The text came in as Windows-1252 encoding.

I've use

import chardet
chardet.detect(var3.encode())

to detect the proper encoding, and the did a

var3 = 'OK\u00c9'.encode('utf8').decode('Windows-1252').encode('utf8').decode('utf8')

conversion to eventually get it in the right format!

Convert unicode to string in python

It's not clear if your input is bytes or a string. If it's a string, you can convert to bytes and decode with unicode-escape:

s = "\\u006A\\u0061\\u0064\\u0072\\u006F"

bytes(s, 'utf-8').decode('unicode-escape')
# 'jadro'

If it's already bytes, then just:

b = b"\\u006A\\u0061\\u0064\\u0072\\u006F"

b.decode('unicode-escape')

How to convert a unicode character \U0001d403 to Escape sequence in python?

If nothing else, you can build your own map for the string translation: for example:

>>> x = '\U0001d403'
>>> x
''
>>> x.translate(str.maketrans({'\U0001d403': 'D'}))
'D'

maketrans can create a mapping of multiple characters, which can be saved to be reused as an argument for many calls to str.translate. Note also that str.translate works for arbitrary strings; the given map will be applied to each character separately.

How convert a string contain unicode characters to UTF in python?

Add u as prefix for the string s then encode it in utf-8.

Your code will look like this:

s = u'\u0628\u06cc\u0633\u06a9\u0648\u06cc\u062a'
s_encoded = s.encode('utf-8')
print(s_encoded)

I hope this helps.

Replacing Unicode Characters with actual symbols

Replace them all at once with re.sub:

import re

string = "testing<U+2019> <U+2014> <U+201C>testing<U+1F603>"

result = re.sub(r'<U\+([0-9a-fA-F]{4,6})>', lambda x: chr(int(x.group(1),16)), string)
print(result)

Output:

testing’ — “testingbr>

The regular expression matches <U+hhhh> where hhhh can be 4-6 hexadecimal characters. Note that Unicode defines code points from U+0000 to U+10FFFF so this accounts for that. The lambda replacement function converts the string hhhh to an integer using base 16 and then converts that number to a Unicode character.

Unicode as String without conversion Python

Here's how to do it the hard way.

ascii_printable = set(unichr(i) for i in range(0x20, 0x7f))

def convert(ch):
if ch in ascii_printable:
return ch
ix = ord(ch)
if ix < 0x100:
return '\\x%02x' % ix
elif ix < 0x10000:
return '\\u%04x' % ix
return '\\U%08x' % ix

output = ''.join(convert(ch) for ch in input)

For Python 3 use chr instead of unichr.



Related Topics



Leave a reply



Submit