Replace non-ASCII characters with a single space
Your ''.join()
expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:
return ''.join([i if ord(i) < 128 else ' ' for i in text])
This handles characters one by one and would still use one space per character replaced.
Your regular expression should just replace consecutive non-ASCII characters with a space:
re.sub(r'[^\x00-\x7F]+',' ', text)
Note the +
there.
Python - Replace non-ascii character in string (»)
In order to replace the content of string using str.replace()
method; you need to firstly decode the string, then replace the text and encode it back to the original text:
>>> a = "hi »"
>>> a.decode('utf-8').replace("»".decode('utf-8'), "").encode('utf-8')
'hi '
You may also use the following regex to remove all the non-ascii characters from the string:
>>> import re
>>> re.sub(r'[^\x00-\x7f]',r'', 'hi »')
'hi '
Replacing non-ASCII characters or specific ASCII character with a space in file
You can just add them into the character class in your regex. For example, to remove non-ASCII characters, plus \031
and (say) characters in the range a
-e
, you would write:
perl -pi -e 's/[[:^ascii:]\031a-e]/ /g'
Edited to add:
For your new requirement:
I have to replace Non ASCII characters with DEC 128 and above with the exception of DEC 145 – 148 and DEC 150-151 with space.
You can write:
perl -pi -e 's/[^[:ascii:]\x91-\x94\x96\x97]/ /g; s/\031/ /g;'
(Note the change from [:^ascii:]
"non-ASCII characters" to [:ascii:]
"ASCII characters", and the change from [...]
"any of the characters ...
" to [^...]
"any character other than ...
".)
How to remove from Python Dictionary Non ASCII characters and replacing with spaces
Building on the answer from this question, you can use re.sub
, removing non-ASCII characters and replacing them with a space.
>>> import re
>>> {k : re.sub(r'[^\x00-\x7F]',' ', v) for k, v in a.items()}
{'age': '12 ', 'name': 'pks '}
This should work on python-3.x (python) as well as python-2.x (pythoff).
Delete space and replace non-ASCII characters in filenames via a loop with a makefile
You said you wanted to change all non-ASCII characters to -
. However based on your attempt, it seems you only want to transform to -
those characters which are not digits or "plain" letters (by plain I mean non accented, non fancy, ...).
cleanfigures:
for f in *; \
do \
ext="$${f##*.}" ; \
base="$${f%.*}" ; \
newbase="$${base//[^a-zA-Z0-9 ]/-}" ; \
echo "$$f" "$${newbase// /}.$$ext" ; \
done
How-to remove non-ascii characters and append a space in the field where the non-ascii characters were using a Perl one-liner?
Take out 2 non-ascii, add one space after field.
Uses non-ascii and 3 spaces as delimiter pairs.
# s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g
[^[:ascii:]]{2}
( .*? [ ]{3} )
Perl test case
$/ = undef;
$str = <DATA>;
$str =~ s/[^[:ascii:]]{2}(.*?[ ]{3})/$1 /g;
print $str;
__DATA__
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR ìPO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
Output >>
SPAM EATER PO BOX 5555 FAKE STREET
FOO BAR PO BOX 1234 LOLLERCOASTER VILLAGE
LOL MAN PO BOX 9876 NEXT DOOR
remove ascii character and replace with non-ascii
Decimal 100 is a "d", and 135 is an extended ascii "ç" or cedilla.
Setting a to all values:
a="$(printf "$(printf '\\x%x' {95..105} 135 135 135 {130..140} )")"
Both this work:
echo "$a"| tr '\144' '\207'
echo "$a"| sed -e $'s/\144/\207/g' # Note the $
If you want to see this characters, write to a file, and open it with encoding IBM850. In an text editor with that capacity you will see (three times a cedilla ç, and the d changed as well):
_`abcçefghiçççéâäàåçêëèïî
UTF-8
For utf-8, things are diferent.
The cedilla in UTF-8 is decimal 231 (hex E7), and it is output with this:
$ printf $'\U0E7'
ç
To get the UTF-8 of values above 127 (7F) and up to 255 (FF) may get tricky because Bash misinterprets some values. This function will allow the conversion from a value to the correct character:
function chr_utf8 {
local val
[[ ${2?Missing Ordinal Value} -lt 0x80000000 ]] || return 1
if [[ ${2} -lt 0x100 && ${2} -ge 0x80 ]]; then
# bash 4.2 incorrectly encodes
# \U000000ff as \xff so encode manually
printf -v val "\\%03o\%03o" $(( (${2}>>6)|0xc0 )) $(( (${2}&0x3f)|0x80 ))
else
printf -v val '\\U%08x' "${2}"
fi
printf -v ${1?Missing Dest Variable} ${val}
}
chr_utf8 a 231
echo "$a"
Conclusion
The solution was actually very simple:
echo "aadddcc" | sed $'s/d/\U0E7/g' # echo $'\U0E7' should output ç
aaçççcc
Test that you get a ç from echo $'\U0E7'
, if not, you need the function above.
How to remove non-ASCII characters and space from column names
One way using pandas.Series.str.replace
and findall
:
df.columns = ["".join(l) for l in df.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
print(df)
Output:
Empty DataFrame
Columns: [Col1name, Col_2_name, Col3__name, Col4__name]
Index: []
Related Topics
What Is a Cross-Platform Way to Get the Home Directory
Elegant Ways to Support Equivalence ("Equality") in Python Classes
Multiprocessing.Pool: When to Use Apply, Apply_Async or Map
Single VS Double Quotes in JSON
Elegant Python Code for Integer Partitioning
How to Convert a Pil Image into a Numpy Array
Secondary Axis with Twinx(): How to Add to Legend
Python 2.7 Getting User Input and Manipulating as String Without Quotations
Django. Override Save for Model
Convert a Number Range to Another Range, Maintaining Ratio
How to Calculate the Date Six Months from the Current Date Using the Datetime Python Module
Why Python 3.6.1 Throws Attributeerror: Module 'Enum' Has No Attribute 'Intflag'
How to Have Clusters of Stacked Bars
Python Max Function Using 'Key' and Lambda Expression
Getting the Index of the Returned Max or Min Item Using Max()/Min() on a List