Pythonic/Efficient Way to Strip Whitespace from Every Pandas Data Frame Cell That Has a Stringlike Object in It

Is there a way to trim/strip whitespace in multiple columns of a pandas dataframe?

Use DataFrame.apply with list of columns:

cols = ['col_1', 'col_2', 'col_4']
df[cols] = df[cols].apply(lambda x: x.str.strip())

Or parse only object columns, it is obviously strings:

cols = df.select_dtypes(object).columns
df[cols] = df[cols].apply(lambda x: x.str.strip())

Remove whitespace from list of strings with pandas/python

If I understand correctly some of your dataframe cells have list type values.

The file_name.json content is below:

[
{
"key1": "value1 ",
"key2": "2",
"key3": ["a", "b 2 ", " exp white space 210"]
},
{
"key1": "value1 ",
"key2": "2",
"key3": []
}
]

Possible solution in this case is the following:

import pandas as pd
import re

df = pd.read_json("file_name.json")


def cleanup_data(value):
if value and type(value) is list:
return [re.sub(r'\s+', ' ', x.strip()) for x in value]
elif value and type(value) is str:
return re.sub(r'\s+', ' ', value.strip())
else:
return value

# apply cleanup function to all cells in dataframe
df = df.applymap(cleanup_data)

df

Returns

     key1  key2                           key3
0 value1 2 [a, b 2, exp white space 210]
1 value1 2 []

Strip / trim all strings of a dataframe

You can use DataFrame.select_dtypes to select string columns and then apply function str.strip.

Notice: Values cannot be types like dicts or lists, because their dtypes is object.

df_obj = df.select_dtypes(['object'])
print (df_obj)
0 a
1 c

df[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
print (df)

0 1
0 a 10
1 c 5

But if there are only a few columns use str.strip:

df[0] = df[0].str.strip()

How can I strip the whitespace from Pandas DataFrame headers?

You can give functions to the rename method. The str.strip() method should do what you want:

In [5]: df
Out[5]:
Year Month Value
0 1 2 3

[1 rows x 3 columns]

In [6]: df.rename(columns=lambda x: x.strip())
Out[6]:
Year Month Value
0 1 2 3

[1 rows x 3 columns]

Note: that this returns a DataFrame object and it's shown as output on screen, but the changes are not actually set on your columns. To make the changes, either use this in a method chain or re-assign the df variabe:

df = df.rename(columns=lambda x: x.strip())

Removing spaces from a nested list of objects with pandas

We can create a lambda function to strip the spaces from string values in dictionary, then map this function on the details column of dataframe:

strip = lambda d: {k: v.strip() if isinstance(v, str) else v for k, v in d.items()}
df['details'] = df['details'].map(lambda L: [strip(d) for d in L])

Result

>>> df.to_dict('r')

[{'name': 'Book1',
'details': [{'id': 30278752,
'isbn': '1594634025',
'isbn13': '9781594634024',
'text_reviews_count': 417,
'work_reviews_count': 3313007,
'work_text_reviews_count': 109912,
'average_rating': '3.92'}]},
{'name': 'Book2',
'details': [{'id': 34006942,
'isbn': '1501173219',
'isbn13': '9781501173219',
'text_reviews_count': 565,
'work_reviews_count': 2142280,
'work_text_reviews_count': 75053,
'average_rating': '4.33'}]}]

Wanted: function to remove whitespace from column headers that is robust to column headers not being strings

You could use a list comprehension, which is quite unusual when working with Pandas as it's usually more efficient to apply built-in Pandas functions (as you've done). But for something as simple as fixing column names, this should be fine:

df = pd.DataFrame(columns=[1, 2, 'A '])
df.columns = [col.strip() if isinstance(col, str) else col for col in df.columns]

Results:

In [75]: df.columns
Out[75]: Index([1, 2, 'A'], dtype='object')


Related Topics



Leave a reply



Submit