How to Remove Non-Ascii Characters But Leave Periods and Spaces

How can I remove non-ASCII characters but leave periods and spaces?

You can filter all characters from the string that are not printable using string.printable, like this:

>>> s = "some\x00string. with\x15 funny characters"
>>> import string
>>> printable = set(string.printable)
>>> filter(lambda x: x in printable, s)
'somestring. with funny characters'

string.printable on my machine contains:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c

EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

''.join(filter(lambda x: x in printable, s))

Replace non-ASCII characters with a single space

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

How to efficiently remove non-ASCII characters and numbers, but keep accented ASCII characters

Here's a way that might help (Python 3.4):

import unicodedata
def remove_nonlatin(s):
s = (ch for ch in s
if unicodedata.name(ch).startswith(('LATIN', 'DIGIT', 'SPACE')))
return ''.join(s)

>>> s = 'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
>>> remove_nonlatin(s)
'awëerwqbròn 1990 23x4 23 we wes mexicqué'

This grabs the unicode names of the characters in the string, and matches charaters who's names start with LATIN, DIGIT, or SPACE.

For example, this would match:

>>> unicodedata.name('S')
'LATIN CAPITAL LETTER S'

And this would not:

>>> unicodedata.name('م')
'ARABIC LETTER MEEM'

I'm reasonably sure that latin characters all have unicode names starting with 'LATIN', so this should filter out other writing scripts, while keeping digits and spaces. There's not a convenient one-liner for punctuation, so in this example, exclamation points and such are also filtered out.

You could presumably filter by code point by using something like ord(c) < 0x250, though you may get some things that you aren't expecting. Or, you could try filtering by unicodedata.category. However, the 'letter' category includes letters from a lot of scripts, so you will still end up with some of these: 'م'.

strip out non valid and non-ascci character from my string in Python

Not a very generic one. But the below solution might work for you

''.join([i for i in text.split() if('<0x') not in i])#'<phone_number><![CDATA[0145236243]]></phone_number>'

Using regex

 re.sub('(<0x\w*>)|\s',"", text) # '<phone_number><![CDATA[0145236243]]></phone_number>'

python removing specific non-ASCII characters from a string

'[x+]4 gur Id lú gal sik-kát ⌈ x x ⌉ [……………]'.encode().decode('ascii', errors='ignore')

out:

'[x+]4 gur Id l gal sik-kt  x x  []'

use encode to convert string to bytes, and decode it by ascii and igore the error.

I think you should use re.sub :

import re

text = "[x+]4 gur Id lú gal sik-kát ⌈ x x ⌉ [……………]"

re.sub('[…⌉⌈]', '', text) # this will replace all the element in [] with ''

out:

'[x+]4 gur Id lú gal sik-kát  x x  []'

How to remove non Ascii characters(non keyboard special charecters) from a text in hive

You can use

regex_replace('123Abh¿½ï¿½ï¿½ï¿½ï¿½v streeÁÉÍÓt', '[^\\x{0000}-\\x7E]+', '')

Here,

  • [^ - start of a negated character class that matches any chars but
    • \x{0000}-\x7E - chars from NULL to ~ char in the ASCII table
  • ]+ - end of the class, match one or more times.

What if I need to remove all special characters apart from spaces and hyphens? - In this case, you need to use

regex_replace('123Abh¿½ï¿½ï¿½ï¿½ï¿½v streeÁÉÍÓt', '[^\\w\\s-]|_', '')

Here, [^\w\s-]|_+ matches any one symbol other than letter, digit, _, whitespace and -, or an underscore (note \w matches underscores, thus it must be added via a |, an alternation operator).

How to keep only alphanumeric and space, and also ignore non-ASCII?

re.sub(r'[^A-Za-z0-9 ]+', '', s)

(Edit) To clarify:
The [] create a list of chars. The ^ negates the list. A-Za-z are the English alphabet and is space. For any one or more of these (that is, anything that is not A-Z, a-z, or space,) replace with the empty string.



Related Topics



Leave a reply



Submit