Remove non-ASCII characters from pandas column
You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:
df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
You can also simplify the join using a chained comparison:
''.join([i if 32 < ord(i) < 126 else " " for i in x])
You could also use string.printable
to filter the chars:
from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))
The fastest is to use translate:
from string import maketrans
del_chars = " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))
df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))
Interestingly that is faster than:
df['DB_user'] = df["DB_user"].str.translate(trans)
Remove non-ascii characters from CSV using pandas
You can read in the file and then use a regular expression to strip out non-ASCII characters:
df.to_csv(csvFile, index=False)
with open(csvFile) as f:
new_text = re.sub(r'[^\x00-\x7F]+', '', f.read())
with open(csvFile, 'w') as f:
f.write(new_text)
Remove non-ASCII characters from string columns in pandas
In general, to remove non-ascii characters, use str.encode
with errors='ignore':
df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')
To perform this on multiple string columns, use
u = df.select_dtypes(object)
df[u.columns] = u.apply(
lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
Although that still won't handle the null characters in your columns. For that, you replace them using regex:
df2 = df.replace(r'\W+', '', regex=True)
How to remove non-ASCII characters and space from column names
One way using pandas.Series.str.replace
and findall
:
df.columns = ["".join(l) for l in df.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
print(df)
Output:
Empty DataFrame
Columns: [Col1name, Col_2_name, Col3__name, Col4__name]
Index: []
How do I remove non-ascii characters (e.g ᧕¿µ´‡»Ž®ºÏƒ¶¹) from texts in pandas dataframe columns?
Option 1 - if you know the complete set of non-ascii characters:
df
Out[36]:
col1 col2
0 aa᧕¿µbb abcd
1 hf4 efgh
2 xxx ijk9
df.replace(regex=True, to_replace=['Ð', '§', '±'], value='') # incomplete here
Out[37]:
col1 col2
0 aa•¿µbb abcd
1 hf4 efgh
2 xxx ijk9
Option 2 - if you can't specify the whole set of non-ascii characters:
Consider using string.printable
:
String of ASCII characters which are considered printable.
from string import printable
printable
Out[38]: '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
df.applymap(lambda y: ''.join(filter(lambda x:
x in string.printable, y)))
Out[14]:
col1 col2
0 aabb abcd
1 hf4 asdf
2 xxx
Note that if an element in the DataFrame is all-non-ascii, it will be replaced with just ''.
How to remove special characers from a column of dataframe using module re?
As this answer shows, you can use map()
with a lambda
function that will assemble and return any expression you like:
df['E'] = df['B'].map(lambda x: re.sub(r'\W+', '', x))
lambda
simply defines anonymous functions. You can leave them anonymous, or assign them to a reference like any other object. my_function = lambda x: x.my_method(3)
is equivalent to def my_function(x): return x.my_method(3)
.
Pandas: How to remove character that include non english characters?
You can use regex to remove designated characters from your strings:
import re
import pandas as pd
records = [{'name':'Foo الÙجيرة'}, {'name':'Battery ÁÁÁ'}]
df = pd.DataFrame.from_records(records)
# Allow alpha numeric and spaces (add additional characters as needed)
pattern = re.compile('[^A-z0-9 ]+')
def clean_text(string):
return pattern.search('', string)
# Apply to your df
df['clean_name'] = df['name'].apply(clean_text)
name clean_name
0 Foo الÙجيرة Foo
1 Battery ÁÁÁ Battery
For more solutions, you can read this SO Q: Python, remove all non-alphabet chars from string
I need to replace non-ASCII characters in pandas data frame column in python 2.7
I have found a solution myself. It might look clumsy, but works perfectly in my case:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text).encode('ascii','backslashreplace'))
I had to replace nan values prior to run that code.
That operation gives me ascii symbols only that can be easily replaced:
def replace_apostrophy(text):
return text.replace("a\u0302\u20acTM","'")
Hope this would help someone.
Related Topics
Tkinter Grid_Forget Is Clearing the Frame
Get Inserted Key Before Commit Session
How to Efficiently Handle European Decimal Separators Using the Pandas Read_CSV Function
Pyeval_Initthreads in Python 3: How/When to Call It? (The Saga Continues Ad Nauseam)
How to Enable MySQL Client Auto Re-Connect with MySQLdb
Drag and Drop Explorer Files to Tkinter Entry Widget
Class Variables Is Shared Across All Instances in Python
How to Overlay Plots from Different Cells
Import Module Works in Terminal But Not in Idle
How to Login to Django Using Tastypie
Use Scipy.Integrate.Quad to Integrate Complex Numbers
How to Sort a Pandas Dataframe by Index
Extract Day of Year and Julian Day from a String Date