Remove Non-Ascii Characters from Pandas Column

Remove non-ASCII characters from pandas column

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the join using a chained comparison:

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You could also use string.printable to filter the chars:

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))

The fastest is to use translate:

from string import maketrans

del_chars = " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly that is faster than:

  df['DB_user'] = df["DB_user"].str.translate(trans)

Remove non-ascii characters from CSV using pandas

You can read in the file and then use a regular expression to strip out non-ASCII characters:

df.to_csv(csvFile, index=False)

with open(csvFile) as f:
new_text = re.sub(r'[^\x00-\x7F]+', '', f.read())

with open(csvFile, 'w') as f:
f.write(new_text)

Remove non-ASCII characters from string columns in pandas

In general, to remove non-ascii characters, use str.encode with errors='ignore':

df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')

To perform this on multiple string columns, use

u = df.select_dtypes(object)
df[u.columns] = u.apply(
lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

Although that still won't handle the null characters in your columns. For that, you replace them using regex:

df2 = df.replace(r'\W+', '', regex=True)

How to remove non-ASCII characters and space from column names

One way using pandas.Series.str.replace and findall:

df.columns = ["".join(l) for l in df.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
print(df)

Output:

Empty DataFrame
Columns: [Col1name, Col_2_name, Col3__name, Col4__name]
Index: []

How do I remove non-ascii characters (e.g ᧕¿µ´‡»Ž®ºÏƒ¶¹) from texts in pandas dataframe columns?

Option 1 - if you know the complete set of non-ascii characters:

df
Out[36]:
col1 col2
0 aa᧕¿µbb abcd
1 hf4 efgh
2 xxx ijk9

df.replace(regex=True, to_replace=['Ð', '§', '±'], value='') # incomplete here
Out[37]:
col1 col2
0 aa•¿µbb abcd
1 hf4 efgh
2 xxx ijk9

Option 2 - if you can't specify the whole set of non-ascii characters:

Consider using string.printable:

String of ASCII characters which are considered printable.

from string import printable

printable
Out[38]: '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

df.applymap(lambda y: ''.join(filter(lambda x:
x in string.printable, y)))
Out[14]:
col1 col2
0 aabb abcd
1 hf4 asdf
2 xxx

Note that if an element in the DataFrame is all-non-ascii, it will be replaced with just ''.

How to remove special characers from a column of dataframe using module re?

As this answer shows, you can use map() with a lambda function that will assemble and return any expression you like:

df['E'] = df['B'].map(lambda x: re.sub(r'\W+', '', x))

lambda simply defines anonymous functions. You can leave them anonymous, or assign them to a reference like any other object. my_function = lambda x: x.my_method(3) is equivalent to def my_function(x): return x.my_method(3).

Pandas: How to remove character that include non english characters?

You can use regex to remove designated characters from your strings:

import re
import pandas as pd

records = [{'name':'Foo الÙجيرة'}, {'name':'Battery ÁÁÁ'}]
df = pd.DataFrame.from_records(records)

# Allow alpha numeric and spaces (add additional characters as needed)
pattern = re.compile('[^A-z0-9 ]+')
def clean_text(string):
return pattern.search('', string)

# Apply to your df
df['clean_name'] = df['name'].apply(clean_text)

name clean_name
0 Foo الÙجيرة Foo
1 Battery ÁÁÁ Battery

For more solutions, you can read this SO Q: Python, remove all non-alphabet chars from string

I need to replace non-ASCII characters in pandas data frame column in python 2.7

I have found a solution myself. It might look clumsy, but works perfectly in my case:

    df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text).encode('ascii','backslashreplace'))

I had to replace nan values prior to run that code.

That operation gives me ascii symbols only that can be easily replaced:

    def replace_apostrophy(text):
return text.replace("a\u0302\u20acTM","'")

Hope this would help someone.



Related Topics



Leave a reply



Submit