Filter Pandas Dataframe by Substring Criteria

How to filter rows containing a string pattern from a Pandas dataframe

In [3]: df[df['ids'].str.contains("ball")]
Out[3]:
     ids  vals
0  aball     1
1  bball     2
3  fball     4

Pandas filter dataframe columns through substring match

You can iterate over index axis:

>>> df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

str.contains takes a constant in first argument pat not a Series.

Filter pandas dataframe if value of column is within a string

You can use .apply + in operator:

s = "ZA1127B.48"

print(df[df.apply(lambda x: x.Part_Number in s, axis=1)])

Prints:

  Part_Number
0       A1127

Filter and replace substring in Pandas

Filter only necessary rows and for them use Series.str.replace, it is better for performance like replace all column without filtering:

m = df['name'].str.contains('Al', na=False)
df.loc[m, 'sport'] = df.loc[m, 'sport'].str.replace('large', 'L', regex=True)
print (df)
    name            sport
0    Bob     tennis small
1   Jane  football medium
2  Alice     basketball L

#100 matched values from 30k
df = pd.DataFrame({'name': ['Bob','Jane','alice'] * 9900 + ['Bob', 'Jane', 'Alice'] * 100, 
                   'sport': ['tennis small','football medium', 'basketball large'] * 10000})

print (df)
        name             sport
0        Bob      tennis small
1       Jane   football medium
2      alice  basketball large
3        Bob      tennis small
4       Jane   football medium
     ...               ...
29995   Jane   football medium
29996  Alice  basketball large
29997    Bob      tennis small
29998   Jane   football medium
29999  Alice  basketball large

[30000 rows x 2 columns]

In [76]: %%timeit
    ...: m = df['name'].str.contains('Al')
    ...: df.loc[m, 'sport'] = df.loc[m, 'sport'].str.replace('large', 'L', regex=True)
    ...: 
    ...: 
14.6 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [77]: %%timeit
    ...: df.loc[df.name.str.contains('Al'), 'sport'] = df.sport.str.replace('large', 'L')
    ...: 
    ...: 
34.8 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [78]: %%timeit
    ...: df['sport'] = np.where(df['name'].str.contains('Al'), df['sport'].str.replace('large', 'L', regex=True),  df['sport'])
    ...: 
    ...: 
35 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#10k matched values from 30k
df = pd.DataFrame({'name': ['Bob', 'Jane','Alice'] * 10000, 
                   'sport': ['tennis small', 'football medium', 'basketball large'] * 10000})



print (df)


In [80]: %%timeit
    ...: m = df['name'].str.contains('Al')
    ...: df.loc[m, 'sport'] = df.loc[m, 'sport'].str.replace('large', 'L', regex=True)
    ...: 
    ...: 
22.2 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [81]: %%timeit
    ...: df.loc[df.name.str.contains('Al'), 'sport'] = df.sport.str.replace('large', 'L')
    ...: 
    ...: 
34 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [82]: %%timeit
    ...: df['sport'] = np.where(df['name'].str.contains('Al'), df['sport'].str.replace('large', 'L', regex=True),  df['sport'])
    ...: 
    ...: 
34.9 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Filtering DataFrame by list of substrings

The str.contains method you're using accepts regex, so use the regex | as or:

df[df['menu_item'].str.contains('fresh|spaghetti')]

Example Input:

          menu_item
0        fresh fish
1      fresher fish
2           lasagna
3     spaghetti o's
4  something edible

Example Output:

       menu_item
0     fresh fish
1   fresher fish
3  spaghetti o's

Pandas how to filter for multiple substrings in series

You can use re.escape() to escape the regex meta-characters in the following way such that you don't need to escape every string in the word list searchfor (no need to change the definition of searchfor):

import re

searchfor = ['.F1', '.N1', '.FW', '.SP']            # no need to escape each string

pattern = '|'.join(map(re.escape, searchfor))       # use re.escape() with map()

mask = (df["id"].str.contains(pattern))

re.escape() will escape each string for you:

print(pattern)

'\\.F1|\\.N1|\\.FW|\\.SP'

Pandas: Using df.eval with string variables as conditional filtering

The error is relative to the + inside of the eval argument, because you are trying to add the DataFrame column values with boolean_arg. What you are looking for is:

def select_twenty(input_df, column_name, boolean_arg, value):
    evaluated = input_df[input_df.eval(column_name + boolean_arg + value)]
    return evaluated

print(select_twenty(df, "A", ">", "20"))

       A
20    21
21    22
22    23
23    24
24    25
..   ...
195  196
196  197
197  198
198  199
199  200

[180 rows x 1 columns]