Removing Unicode \U2026 Like Characters in a String in Python2.7

Removing unicode \u2026 like characters in a string in python2.7

Python 2.x

>>> s
'This is some \\u03c0 text that has to be cleaned\\u2026! it\\u0027s annoying!'
>>> print(s.decode('unicode_escape').encode('ascii','ignore'))
This is some  text that has to be cleaned! it's annoying!

Python 3.x

>>> s = 'This is some \u03c0 text that has to be cleaned\u2026! it\u0027s annoying!'
>>> s.encode('ascii', 'ignore')
b"This is some  text that has to be cleaned! it's annoying!"

Replace non-ASCII characters with a single space

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

Special Unicode Characters are not removed in Python 3

You can just strip out the characters using strip.

>>> keys=['\u202cABCD', '\u202cXYZ\u202c']
>>> for key in keys:
...     print(key)
... 
ABCD
XYZ‬
>>> newkeys=[key.strip('\u202c') for key in keys]
>>> print(keys)
['\u202cABCD', '\u202cXYZ\u202c']
>>> print(newkeys)
['ABCD', 'XYZ']
>>>

Tried 1 of your methods, it does work for me:

>>> keys
['\u202cABCD', '\u202cXYZ\u202c']
>>> newkeys=[]
>>> for key in keys:
...     newkeys += [key.replace('\u202c', '')]
... 
>>> newkeys
['ABCD', 'XYZ']
>>>

Remove unicode characters python

In [10]: from unicodedata import normalize

In [11]: out_text = normalize('NFKD', input_text).encode('ascii','ignore')

Try this.

Edit

Actually normalize Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. If you wana more about NFKD go to this link

In [12]: u = unichr(40960) + u'abcd' + unichr(1972)
In [13]: u.encode('utf-8')
Out[13]: '\xea\x80\x80abcd\xde\xb4'
In [14]: u
Out[14]: u'\ua000abcd\u07b4'
In [16]: u.encode('ascii', 'ignore')
Out[16]: 'abcd'

From the above code you will get what encode('ascii','ignore') does.

Ref : https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

Removing Unicode \uxxxx in String from JSON Using Regex

When the data is in a text file, \u2019 is a string. But once loaded in json it becomes unicode and replacement doesn't work anymore.

So you have to apply your regex before loading into json and it works

tweet = json.loads(removeunicode(line))

of course it processes the entire raw line. You also can remove non-ascii chars from the decoded text by checking character code like this (note that it is not strictly equivalent):

 text = "".join([x for x in tweet['text'] if ord(x)<128])

Set of hidden unicode characters in a string

The character \x7f is the ascii character DEL, which explains why your attempts did not work. To remove all "special" ascii characters use this code:

See here for the bytes.decode documentation.

import string
a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if i in string.printable)))

or this if no you don't want to import string:

a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if 31 < ord(i) < 127 or i in '\r\n')))

Strip special characters from string, retain, alphabets, numbers and punctuation marks

Solved,

print (newstring.decode('unicode_escape').encode('ascii','ignore'))

Output:

Q18. On a scale from 0 to 10 where 0 means not at all interested' and 10 means very interested', how interested are you in helping to address problems that affect poor people in poor countries?

Removing Unicode \U2026 Like Characters in a String in Python2.7