Removing unicode \u2026 like characters in a string in python2.7
Python 2.x
>>> s
'This is some \\u03c0 text that has to be cleaned\\u2026! it\\u0027s annoying!'
>>> print(s.decode('unicode_escape').encode('ascii','ignore'))
This is some text that has to be cleaned! it's annoying!
Python 3.x
>>> s = 'This is some \u03c0 text that has to be cleaned\u2026! it\u0027s annoying!'
>>> s.encode('ascii', 'ignore')
b"This is some text that has to be cleaned! it's annoying!"
Replace non-ASCII characters with a single space
Your ''.join()
expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:
return ''.join([i if ord(i) < 128 else ' ' for i in text])
This handles characters one by one and would still use one space per character replaced.
Your regular expression should just replace consecutive non-ASCII characters with a space:
re.sub(r'[^\x00-\x7F]+',' ', text)
Note the +
there.
Special Unicode Characters are not removed in Python 3
You can just strip out the characters using strip
.
>>> keys=['\u202cABCD', '\u202cXYZ\u202c']
>>> for key in keys:
... print(key)
...
ABCD
XYZ
>>> newkeys=[key.strip('\u202c') for key in keys]
>>> print(keys)
['\u202cABCD', '\u202cXYZ\u202c']
>>> print(newkeys)
['ABCD', 'XYZ']
>>>
Tried 1 of your methods, it does work for me:
>>> keys
['\u202cABCD', '\u202cXYZ\u202c']
>>> newkeys=[]
>>> for key in keys:
... newkeys += [key.replace('\u202c', '')]
...
>>> newkeys
['ABCD', 'XYZ']
>>>
Remove unicode characters python
In [10]: from unicodedata import normalize
In [11]: out_text = normalize('NFKD', input_text).encode('ascii','ignore')
Try this.
Edit
Actually normalize Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. If you wana more about NFKD go to this link
In [12]: u = unichr(40960) + u'abcd' + unichr(1972)
In [13]: u.encode('utf-8')
Out[13]: '\xea\x80\x80abcd\xde\xb4'
In [14]: u
Out[14]: u'\ua000abcd\u07b4'
In [16]: u.encode('ascii', 'ignore')
Out[16]: 'abcd'
From the above code you will get what encode('ascii','ignore')
does.
Ref : https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize
Removing Unicode \uxxxx in String from JSON Using Regex
When the data is in a text file, \u2019
is a string. But once loaded in json
it becomes unicode and replacement doesn't work anymore.
So you have to apply your regex before loading into json and it works
tweet = json.loads(removeunicode(line))
of course it processes the entire raw line. You also can remove non-ascii chars from the decoded text
by checking character code like this (note that it is not strictly equivalent):
text = "".join([x for x in tweet['text'] if ord(x)<128])
Set of hidden unicode characters in a string
The character \x7f
is the ascii character DEL, which explains why your attempts did not work. To remove all "special" ascii characters use this code:
See here for the bytes.decode documentation.
import string
a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if i in string.printable)))
or this if no you don't want to import string:
a = b'There is a set of hidden character here =>\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f\x7f <= but i will get printed in console'
print(repr(a))
print(repr(''.join(i for i in a.decode('ascii', 'ignore') if 31 < ord(i) < 127 or i in '\r\n')))
Strip special characters from string, retain, alphabets, numbers and punctuation marks
Solved,
print (newstring.decode('unicode_escape').encode('ascii','ignore'))
Output:
Q18. On a scale from 0 to 10 where 0 means not at all interested' and 10 means very interested', how interested are you in helping to address problems that affect poor people in poor countries?
Related Topics
How to Use Python to Execute a Curl Command
Selenium Element Not Visible Exception
How to Decorate an Instance Method with a Decorator Class
What Is a '"Python"' Layer in Caffe
How to Make My Player Rotate Towards Mouse Position
How to Write to a File, Using the Logging Python Module
Compiling with Cython and Mingw Produces Gcc: Error: Unrecognized Command Line Option '-Mno-Cygwin'
Adding Meta-Information/Metadata to Pandas Dataframe
Running Interactive Commands in Paramiko
Best Way to Format Integer as String with Leading Zeros
How to Fix "Importerror: No Module Named ..." Error in Python
Removing Unicode \U2026 Like Characters in a String in Python2.7
Word Count from a Txt File Program