Pandas: Filter Rows of Dataframe with Operator Chaining

pandas: filter rows of DataFrame with operator chaining

I'm not entirely sure what you want, and your last line of code does not help either, but anyway:

"Chained" filtering is done by "chaining" the criteria in the boolean index.

In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
A B C D
d 1 3 9 6

If you want to chain methods, you can add your own mask method and use that one.

In [90]: def mask(df, key, value):
....: return df[df[key] == value]
....:

In [92]: pandas.DataFrame.mask = mask

In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))

In [95]: df.ix['d','A'] = df.ix['a', 'A']

In [96]: df
Out[96]:
A B C D
a 1 4 9 1
b 4 5 0 2
c 5 5 1 0
d 1 3 9 6

In [97]: df.mask('A', 1)
Out[97]:
A B C D
a 1 4 9 1
d 1 3 9 6

In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
A B C D
d 1 3 9 6

Filter rows with any one fulfills conditions in each group with pandas method chaining

You can apply groupby then filter command to get the output.

df.groupby('user_id').filter(lambda x: (x['created_per_week'] != 0).any())

user_id is_manually created_per_week
0 10 True 59
1 10 False 90
2 33 True 0
3 33 False 64

Filter pandas with operator chain from list

Use boolean indexing with boolean mask created by np.all:

print (df)
A B C
0 5 8 10
1 5 4 1
2 7 5 6
3 6 6 0
4 3 4 1

thr = [3, 6, 9]

df = df[np.all(df.values > np.array(thr), axis=1)]
print (df)
A B C
0 5 8 10

Pandas solution with DataFrame.gt (>) with DataFrame.all:

df = df[df.gt(thr).all(axis=1)]
print (df)
A B C
0 5 8 10

And solution with list comaprehension:

masks = [df.iloc[:, i] > j for i, j in enumerate(thr)]
df = df[pd.concat(masks, axis=1).all(axis=1)]

Alternative:

df = df[np.logical_and.reduce(masks)]

Explanation:

First compare all values by np.array - is necessary same lengths of thr and columns:

print (df.values > np.array(thr))
[[ True True True]
[ True False False]
[ True False False]
[ True False False]
[False False False]]

Then check all Trues per rows by numpy.all:

print (np.all(df.values > np.array(thr), axis=1))
[ True False False False False]

Filter rows from a DataFrame with matching pairs of strings

You can groupby "ID" and the condition and transform nunique method to count the number of unique "Period"s and filter the rows with more than 1 unique "Period" values:

out = df[df.groupby(['ID', (df["Period"].str.contains("0 Month") | df["Period"].str.contains("3 Month"))])['Period'].transform('nunique') > 1]

Note that, instead of | you can use isin:

out = df[df.groupby(['ID', df["Period"].isin(['0 Month', '3 Month'])])['Period'].transform('nunique') > 1]

or combine the strings to match inside str.contains:

out = df[df.groupby(['ID', df["Period"].str.contains('0|3')])['Period'].transform('nunique') > 1]

Output:

   ID   Period
0 1 0 Month
1 2 0 Month
3 1 3 Month
4 2 3 Month

Efficient way to apply multiple filters to pandas DataFrame or Series

Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]:
1 1
2 2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]:
col1 col2
1 1 11
2 2 12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]:
col1 col2
1 1 11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n): 
return op(x[col],n)

In [15]: def f(x, *b):
return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]:
col1 col2
1 1 11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
col1 col2
1 1 11

Pandas Method Chaining: getting KeyError on calculated column

If accessing a column that doesn't yet exist, that must be done through a lambda:

dfs = pd.read_html('https://www.collegepollarchive.com/football/ap/seasons.cfm?seasonid=2019')
df = dfs[0][['Team (FPV)', 'Rank', 'Pts']].copy()
df['Year'] = 2016
df['Type'] = 'final'
df = df.assign(rank_int = pd.to_numeric(df['Rank'], errors='coerce').fillna(0).astype(int),
gprank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
ck_rank = lambda x: x['gprank'].sub(x['rank_int']))
print(df)

Output:

            Team (FPV) Rank   Pts  Year   Type  rank_int  gprank  ck_rank
0 LSU (62) 1 1550 2016 final 1 1.0 0.0
1 Clemson 2 1487 2016 final 2 2.0 0.0
2 Ohio State 3 1426 2016 final 3 3.0 0.0
3 Georgia 4 1336 2016 final 4 4.0 0.0
4 Oregon 5 1249 2016 final 5 5.0 0.0
5 Florida 6 1211 2016 final 6 6.0 0.0
6 Oklahoma 7 1179 2016 final 7 7.0 0.0
7 Alabama 8 1159 2016 final 8 8.0 0.0
8 Penn State 9 1038 2016 final 9 9.0 0.0
9 Minnesota 10 952 2016 final 10 10.0 0.0
10 Wisconsin 11 883 2016 final 11 11.0 0.0
11 Notre Dame 12 879 2016 final 12 12.0 0.0
12 Baylor 13 827 2016 final 13 13.0 0.0
13 Auburn 14 726 2016 final 14 14.0 0.0
14 Iowa 15 699 2016 final 15 15.0 0.0
15 Utah 16 543 2016 final 16 16.0 0.0
16 Memphis 17 528 2016 final 17 17.0 0.0
17 Michigan 18 468 2016 final 18 18.0 0.0
18 Appalachian State 19 466 2016 final 19 19.0 0.0
19 Navy 20 415 2016 final 20 20.0 0.0
20 Cincinnati 21 343 2016 final 21 21.0 0.0
21 Air Force 22 209 2016 final 22 22.0 0.0
22 Boise State 23 188 2016 final 23 23.0 0.0
23 UCF 24 78 2016 final 24 24.0 0.0
24 Texas 25 69 2016 final 25 25.0 0.0
25 Texas A&M RV 54 2016 final 0 26.0 26.0
26 Florida Atlantic RV 46 2016 final 0 27.0 27.0
27 Washington RV 39 2016 final 0 28.0 28.0
28 Virginia RV 28 2016 final 0 29.0 29.0
29 USC RV 16 2016 final 0 30.0 30.0
30 San Diego State RV 13 2016 final 0 31.0 31.0
31 Arizona State RV 12 2016 final 0 32.0 32.0
32 SMU RV 10 2016 final 0 33.0 33.0
33 Tennessee RV 8 2016 final 0 34.0 34.0
34 California RV 6 2016 final 0 35.0 35.0
35 Kansas State RV 2 2016 final 0 36.0 36.0
36 Kentucky RV 2 2016 final 0 36.0 36.0
37 Louisiana RV 2 2016 final 0 36.0 36.0
38 Louisiana Tech RV 2 2016 final 0 36.0 36.0
39 North Dakota State RV 2 2016 final 0 36.0 36.0
40 Hawaii NR 0 2016 final 0 41.0 41.0
41 Louisville NR 0 2016 final 0 41.0 41.0
42 Oklahoma State NR 0 2016 final 0 41.0 41.0

Filter dataframe rows based on return value of foo() applied to first column

TLDR

mask = df.apply(lambda row: foo(row['Path']), axis=1)
res: pd.DataFrame = df[mask]

Solution

To filter the rows of a DataFrame according to the return value of foo(str: str) -> bool applied to the values contained in column Path of each row the solution is to generate a mask with pandas.DataFrame.apply().

How does a mask work?

The mask works as follow: given a dataframe df: pd.DataFrame and a mask: pd.Series<bool> accessing with square brackets df[mask] will result in a new DataFrame with only the rows corresponnding to a True value of the mask series.

How to get the mask

Since df.apply(fuction, axis, ...) takes as input a function one would be tempted to pass foo() as argument of the apply() but this is wrong.
The function argumennt of apply() must be a function taking as argument a pd.Series and not a string therefore the correct way to get the mask is the following, where axis = 1 indicates that we're applyinng the lambda to get the boolean value to every row of the dataframe rather than to every column.

mask = df.apply(lambda row: foo(row['Path']), axis=1)


Related Topics



Leave a reply



Submit