Replace Non-Ascii Characters with a Single Space

Replace non-ASCII characters with a single space

Your ''.join() expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:

return ''.join([i if ord(i) < 128 else ' ' for i in text])

This handles characters one by one and would still use one space per character replaced.

Your regular expression should just replace consecutive non-ASCII characters with a space:

re.sub(r'[^\x00-\x7F]+',' ', text)

Note the + there.

Python - Replace non-ascii character in string (»)

In order to replace the content of string using str.replace() method; you need to firstly decode the string, then replace the text and encode it back to the original text:

>>> a = "hi »"
>>> a.decode('utf-8').replace("»".decode('utf-8'), "").encode('utf-8')
'hi '

You may also use the following regex to remove all the non-ascii characters from the string:

>>> import re
>>> re.sub(r'[^\x00-\x7f]',r'', 'hi »')
'hi '

Replacing non-ASCII characters or specific ASCII character with a space in file

You can just add them into the character class in your regex. For example, to remove non-ASCII characters, plus \031 and (say) characters in the range a-e, you would write:

perl -pi -e 's/[[:^ascii:]\031a-e]/ /g'

Edited to add:

For your new requirement:

I have to replace Non ASCII characters with DEC 128 and above with the exception of DEC 145 – 148 and DEC 150-151 with space.

You can write:

perl -pi -e 's/[^[:ascii:]\x91-\x94\x96\x97]/ /g; s/\031/ /g;'

(Note the change from [:^ascii:] "non-ASCII characters" to [:ascii:] "ASCII characters", and the change from [...] "any of the characters ..." to [^...] "any character other than ...".)

How to remove from Python Dictionary Non ASCII characters and replacing with spaces

Building on the answer from this question, you can use re.sub, removing non-ASCII characters and replacing them with a space.

>>> import re
>>> {k : re.sub(r'[^\x00-\x7F]',' ', v) for k, v in a.items()}
{'age': '12 ', 'name': 'pks '}

This should work on python-3.x (python) as well as python-2.x (pythoff).

Delete space and replace non-ASCII characters in filenames via a loop with a makefile

You said you wanted to change all non-ASCII characters to -. However based on your attempt, it seems you only want to transform to - those characters which are not digits or "plain" letters (by plain I mean non accented, non fancy, ...).

cleanfigures:
for f in *; \
do \
ext="$${f##*.}" ; \
base="$${f%.*}" ; \
newbase="$${base//[^a-zA-Z0-9 ]/-}" ; \
echo "$$f" "$${newbase// /}.$$ext" ; \
done

How-to remove non-ascii characters and append a space in the field where the non-ascii characters were using a Perl one-liner?

Take out 2 non-ascii, add one space after field.

Uses non-ascii and 3 spaces as delimiter pairs.

 #  s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g

[^[:ascii:]]{2}
( .*? [ ]{3} )

Perl test case

$/ = undef;
$str = <DATA>;
$str =~ s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g;
print $str;

__DATA__
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR ìPO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR

Output >>

SPAM EATER       PO BOX 5555          FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR

remove ascii character and replace with non-ascii

Decimal 100 is a "d", and 135 is an extended ascii "ç" or cedilla.

Setting a to all values:

a="$(printf "$(printf '\\x%x' {95..105} 135 135 135 {130..140} )")"

Both this work:

echo "$a"| tr '\144' '\207'
echo "$a"| sed -e $'s/\144/\207/g' # Note the $

If you want to see this characters, write to a file, and open it with encoding IBM850. In an text editor with that capacity you will see (three times a cedilla ç, and the d changed as well):

_`abcçefghiçççéâäàåçêëèïî

UTF-8

For utf-8, things are diferent.

The cedilla in UTF-8 is decimal 231 (hex E7), and it is output with this:

$ printf $'\U0E7'
ç

To get the UTF-8 of values above 127 (7F) and up to 255 (FF) may get tricky because Bash misinterprets some values. This function will allow the conversion from a value to the correct character:

function chr_utf8 {
local val
[[ ${2?Missing Ordinal Value} -lt 0x80000000 ]] || return 1

if [[ ${2} -lt 0x100 && ${2} -ge 0x80 ]]; then

# bash 4.2 incorrectly encodes
# \U000000ff as \xff so encode manually
printf -v val "\\%03o\%03o" $(( (${2}>>6)|0xc0 )) $(( (${2}&0x3f)|0x80 ))
else
printf -v val '\\U%08x' "${2}"
fi
printf -v ${1?Missing Dest Variable} ${val}
}

chr_utf8 a 231
echo "$a"

Conclusion

The solution was actually very simple:

echo "aadddcc" | sed $'s/d/\U0E7/g'       # echo $'\U0E7' should output ç
aaçççcc

Test that you get a ç from echo $'\U0E7', if not, you need the function above.

How to remove non-ASCII characters and space from column names

One way using pandas.Series.str.replace and findall:

df.columns = ["".join(l) for l in df.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
print(df)

Output:

Empty DataFrame
Columns: [Col1name, Col_2_name, Col3__name, Col4__name]
Index: []


Related Topics



Leave a reply



Submit