Remove Unwanted Parts from Strings in a Column

Remove unwanted parts from strings in a column

data['result'] = data['result'].map(lambda x: x.lstrip('+-').rstrip('aAbBcC'))

Remove unwanted part of strings in a column with Pandas

Use Series.replace:

df['Message'] = df['Message'].replace('^.*\]\s*','', regex=True)

Remove unwanted parts from strings in Dataframe

If every passenger has their title, then you can use str.split + explode, then select every second item starting from the first item, then groupby the index and join back:

out = df['Passengers'].str.split(',').explode()[::2].groupby(level=0).agg(', '.join)

or str.split + explode and apply a lambda that does the selection + join

out = df['Passengers'].str.split(',').apply(lambda x: ', '.join(x[::2]))

Output:

0                 Sally Muller,  Mark Smith,  John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia...

Edit:

If not everyone has a title, then you can create a set of titles and split and filter out the titles. If the order of the names don't matter in each row, then you can use set difference and cast each set to a list in a list comprehension:

titles = {'President', 'Vicepresident', 'Chief of Staff', 'Special Effects', 'Vice Chief of Staff'}

out = pd.Series([list(set(x.split(', ')) - titles) for x in df['Passengers']])

If order matters, then you can use a nested list comprehension:

out = pd.Series([[i for i in x.split(', ') if i not in titles] for x in df['Passengers']])

Create a column removing unwanted parts of strings based on condition

One possibility could be to use map:

df['macro_games'] = df['Genres'].astype(str).map(lambda x : x.split(';')[0])

Output:

>>> df
Genres macro_genres Last Updated
0 Finance Finance March 10, 2018
1 Arcade Arcade May 24, 2018
2 Business Business April 11, 2018
3 Photography Photography November 6, 2014
4 Entertainment;Brain Games Entertainment March 9, 2018
5 Medical Medical May 17, 2018
6 Tools Tools June 3, 2016
7 Casual;Brain Games Casual April 10, 2016
8 Medical Medical July 16, 2018
9 Entertainment Entertainment May 17, 2017

Runtime Comparision on 1k dataframe:

#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
535 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
1.36 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
527 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Runtime Comparision on 10k dataframe:

#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
3.62 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#str split method (slowest)
>>> %timeit -n 1000 df['Genres'].str.split(';').str[0]
10 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
3.47 ms ± 59.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Runtime Comparision on 50k dataframe:

#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
17 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#map method
>>> %timeit -n 1000 df['Genres'].map(lambda x : x.split(';')[0])
16.7 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Runtime Comparision on 100k dataframe:

#apply method
>>> %timeit -n 1000 df['Genres'].apply(lambda x: x.split(';')[0])
34.1 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#map method
>>> %timeit -n 1000 df['Genres'].astype(str).map(lambda x : x.split(';')[0])
35.5 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Remove unwanted parts from strings in a range of columns

Part #1

You can define '-' to be a NaN value when reading in the data to your DataFrame. More specifically by use of na_values in your pd.read_csv() call.

See docs here

Part #2

As earlier suggested by MaxU you can use .replace() like this:

df.replace(r'[\s\-,\.]+', r'', regex=True, inplace=True)

Note that this will not have any effect on non-strings.

Hope this helps!

How to remove unwanted dots from strings in pandas column?

Try:

df["parts"] = df["parts"].str.replace(r"\.*\d+", "", regex=True)
print(df)

Prints:

         parts
0 mouse.pad.v
1 key.board.c
2 pen.color.r

Input dataframe:

               parts
0 mouse.pad.v.1.2
1 key.board.1.0.c30
2 pen.color.4.32.r

rstrip() the unwanted parts from string column

Assuming that the fraction would always start the column's value, we can use str.extract here as follows:

df['pct'] = df['colname'].str.extract(r'^(\d+/\d+)')

Demo

Removing characters from a string in pandas

use replace:

temp_dataframe['PPI'].replace('PPI/','',regex=True,inplace=True)

or string.replace:

temp_dataframe['PPI'].str.replace('PPI/','')

Pandas DataFrame: remove unwanted parts from strings before and after what I want to keep

I believe need split and select second lists:

data_cleaner['Project ID'] = data_cleaner['Project ID'].str.split('/').str[1]

Or extract by regex - /(\d{4})/ means get numeric with length 4 between //:

data_cleaner['Project ID'] = data_cleaner['Project ID'].str.extract('/(\d{4})/', expand=False)

print (data_cleaner)
Project ID
0 2013
1 2013
2 2013
3 2013
4 2013


Related Topics



Leave a reply



Submit