Pandas: Filter Rows of Dataframe with Operator Chaining

pandas: filter rows of DataFrame with operator chaining

I'm not entirely sure what you want, and your last line of code does not help either, but anyway:

"Chained" filtering is done by "chaining" the criteria in the boolean index.

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [99]: df[(df.A == 1) & (df.D == 6)]
Out[99]:
   A  B  C  D
d  1  3  9  6

If you want to chain methods, you can add your own mask method and use that one.

In [90]: def mask(df, key, value):
   ....:     return df[df[key] == value]
   ....:

In [92]: pandas.DataFrame.mask = mask

In [93]: df = pandas.DataFrame(np.random.randint(0, 10, (4,4)), index=list('abcd'), columns=list('ABCD'))

In [95]: df.ix['d','A'] = df.ix['a', 'A']

In [96]: df
Out[96]:
   A  B  C  D
a  1  4  9  1
b  4  5  0  2
c  5  5  1  0
d  1  3  9  6

In [97]: df.mask('A', 1)
Out[97]:
   A  B  C  D
a  1  4  9  1
d  1  3  9  6

In [98]: df.mask('A', 1).mask('D', 6)
Out[98]:
   A  B  C  D
d  1  3  9  6

Filter rows with any one fulfills conditions in each group with pandas method chaining

You can apply groupby then filter command to get the output.

df.groupby('user_id').filter(lambda x: (x['created_per_week'] != 0).any())

    user_id is_manually created_per_week
0       10        True                59
1       10       False                90
2       33        True                 0
3       33       False                64

Filter pandas with operator chain from list

Use boolean indexing with boolean mask created by np.all:

print (df)
   A  B   C
0  5  8  10
1  5  4   1
2  7  5   6
3  6  6   0
4  3  4   1

thr = [3, 6, 9]

df = df[np.all(df.values > np.array(thr), axis=1)]
print (df)
   A  B   C
0  5  8  10

Pandas solution with DataFrame.gt (>) with DataFrame.all:

df = df[df.gt(thr).all(axis=1)]
print (df)
   A  B   C
0  5  8  10

And solution with list comaprehension:

masks = [df.iloc[:, i] > j for i, j in enumerate(thr)]
df = df[pd.concat(masks, axis=1).all(axis=1)]

Alternative:

df = df[np.logical_and.reduce(masks)]

Explanation:

First compare all values by np.array - is necessary same lengths of thr and columns:

print (df.values > np.array(thr))
[[ True  True  True]
 [ True False False]
 [ True False False]
 [ True False False]
 [False False False]]

Then check all Trues per rows by numpy.all:

print (np.all(df.values > np.array(thr), axis=1))
[ True False False False False]

Filter rows from a DataFrame with matching pairs of strings

You can groupby "ID" and the condition and transform nunique method to count the number of unique "Period"s and filter the rows with more than 1 unique "Period" values:

out = df[df.groupby(['ID', (df["Period"].str.contains("0 Month") | df["Period"].str.contains("3 Month"))])['Period'].transform('nunique') > 1]

Note that, instead of | you can use isin:

out = df[df.groupby(['ID', df["Period"].isin(['0 Month', '3 Month'])])['Period'].transform('nunique') > 1]

or combine the strings to match inside str.contains:

out = df[df.groupby(['ID', df["Period"].str.contains('0|3')])['Period'].transform('nunique') > 1]

Output:

   ID   Period
0   1  0 Month
1   2  0 Month
3   1  3 Month
4   2  3 Month

Efficient way to apply multiple filters to pandas DataFrame or Series

Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

Pandas Method Chaining: getting KeyError on calculated column

If accessing a column that doesn't yet exist, that must be done through a lambda:

dfs = pd.read_html('https://www.collegepollarchive.com/football/ap/seasons.cfm?seasonid=2019')
df = dfs[0][['Team (FPV)', 'Rank', 'Pts']].copy()
df['Year'] = 2016
df['Type'] = 'final'
df = df.assign(rank_int = pd.to_numeric(df['Rank'], errors='coerce').fillna(0).astype(int),
               gprank = df.groupby(['Year','Type'])['Pts'].rank(ascending=0,method='min'),
               ck_rank = lambda x: x['gprank'].sub(x['rank_int']))
print(df)

Output:

            Team (FPV) Rank   Pts  Year   Type  rank_int  gprank  ck_rank
0             LSU (62)    1  1550  2016  final         1     1.0      0.0
1              Clemson    2  1487  2016  final         2     2.0      0.0
2           Ohio State    3  1426  2016  final         3     3.0      0.0
3              Georgia    4  1336  2016  final         4     4.0      0.0
4               Oregon    5  1249  2016  final         5     5.0      0.0
5              Florida    6  1211  2016  final         6     6.0      0.0
6             Oklahoma    7  1179  2016  final         7     7.0      0.0
7              Alabama    8  1159  2016  final         8     8.0      0.0
8           Penn State    9  1038  2016  final         9     9.0      0.0
9            Minnesota   10   952  2016  final        10    10.0      0.0
10           Wisconsin   11   883  2016  final        11    11.0      0.0
11          Notre Dame   12   879  2016  final        12    12.0      0.0
12              Baylor   13   827  2016  final        13    13.0      0.0
13              Auburn   14   726  2016  final        14    14.0      0.0
14                Iowa   15   699  2016  final        15    15.0      0.0
15                Utah   16   543  2016  final        16    16.0      0.0
16             Memphis   17   528  2016  final        17    17.0      0.0
17            Michigan   18   468  2016  final        18    18.0      0.0
18   Appalachian State   19   466  2016  final        19    19.0      0.0
19                Navy   20   415  2016  final        20    20.0      0.0
20          Cincinnati   21   343  2016  final        21    21.0      0.0
21           Air Force   22   209  2016  final        22    22.0      0.0
22         Boise State   23   188  2016  final        23    23.0      0.0
23                 UCF   24    78  2016  final        24    24.0      0.0
24               Texas   25    69  2016  final        25    25.0      0.0
25           Texas A&M   RV    54  2016  final         0    26.0     26.0
26    Florida Atlantic   RV    46  2016  final         0    27.0     27.0
27          Washington   RV    39  2016  final         0    28.0     28.0
28            Virginia   RV    28  2016  final         0    29.0     29.0
29                 USC   RV    16  2016  final         0    30.0     30.0
30     San Diego State   RV    13  2016  final         0    31.0     31.0
31       Arizona State   RV    12  2016  final         0    32.0     32.0
32                 SMU   RV    10  2016  final         0    33.0     33.0
33           Tennessee   RV     8  2016  final         0    34.0     34.0
34          California   RV     6  2016  final         0    35.0     35.0
35        Kansas State   RV     2  2016  final         0    36.0     36.0
36            Kentucky   RV     2  2016  final         0    36.0     36.0
37           Louisiana   RV     2  2016  final         0    36.0     36.0
38      Louisiana Tech   RV     2  2016  final         0    36.0     36.0
39  North Dakota State   RV     2  2016  final         0    36.0     36.0
40              Hawaii   NR     0  2016  final         0    41.0     41.0
41          Louisville   NR     0  2016  final         0    41.0     41.0
42      Oklahoma State   NR     0  2016  final         0    41.0     41.0

Filter dataframe rows based on return value of foo() applied to first column

TLDR

mask = df.apply(lambda row: foo(row['Path']), axis=1)
res: pd.DataFrame = df[mask]

Solution

To filter the rows of a DataFrame according to the return value of foo(str: str) -> bool applied to the values contained in column Path of each row the solution is to generate a mask with pandas.DataFrame.apply().

How does a mask work?

The mask works as follow: given a dataframe df: pd.DataFrame and a mask: pd.Series<bool> accessing with square brackets df[mask] will result in a new DataFrame with only the rows corresponnding to a True value of the mask series.

How to get the mask

Since df.apply(fuction, axis, ...) takes as input a function one would be tempted to pass foo() as argument of the apply() but this is wrong.
The function argumennt of apply() must be a function taking as argument a pd.Series and not a string therefore the correct way to get the mask is the following, where axis = 1 indicates that we're applyinng the lambda to get the boolean value to every row of the dataframe rather than to every column.

mask = df.apply(lambda row: foo(row['Path']), axis=1)

Pandas: Filter Rows of Dataframe with Operator Chaining