How to Filter Data Frame with Conditions of Two Columns

Efficient way to apply multiple filters to pandas DataFrame or Series

Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

Pandas: Filtering multiple conditions

Use () because operator precedence:

temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]

Alternatively, create conditions on separate rows:

cond1 = df["bin"] == 3    
cond2 = df["days since"] > 7
cond3 = ~df["Def"]

temp2 = df[cond1 & cond2 & cond3]

Sample:

df = pd.DataFrame({'Def':[True] *2 + [False]*4,
                   'days since':[7,8,9,14,2,13],
                   'bin':[1,3,5,3,3,3]})

print (df)
     Def  bin  days since
0   True    1           7
1   True    3           8
2  False    5           9
3  False    3          14
4  False    3           2
5  False    3          13

temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
print (temp2)
     Def  bin  days since
3  False    3          14
5  False    3          13

Filtering two columns of a dataframe with filter

Rather than using filter, I would suggest a more idiomatic way to proceed.

Suppose you want to filter on the word "Mortar":

# Simply define two filtering masks, since one column contains lists of strings,
# whereas the other one simply contains strings
mask1 = df["Name"].apply(lambda x: "Mortar".casefold() in "".join(x).casefold())
mask2 = df["NAME_FILE"].apply(lambda x: "Mortar".casefold() in x.casefold())

# If a filtered word is present in either column of the same row,
# then the whole row should be kept
print(df.loc[mask1 | mask2, :])

                                                Name                            NAME_FILE
0  [ VerbundmÃ¶rtel ,  Compound Mortar ,  Malta p...            AdhesiveCoveringPlaster_2
                                ^^^^^^
1  [ StoLevell In Absolute ,  StoLevell In Absolu...  AdhesiveMortarLevellInForAEVERO_720
                                                              ^^^^^^

Pandas filtering based on 2 different columns conditions

Use this:

data = data.loc[ ~((data.Name == 'RACHEL') & (data.Job == 'CHEF')) ]

You want to remove all the rows that have both Name = RACHEL and Job = CHEF. So just write that condition and invert it to filter them out.

Dataframe filtering with multiple conditions on different columns

Using the answer by Corralien and taking advantage of what's written by tiitinha, and considering the possibility to have some NaN values, here is how I put it all together:

df.replace([np.nan],  np.inf,inplace=True)

condlist = [df['A'].between(-100, 100) | df['A'] == np.inf,
            df['B'].between(-100, 100) | df['B'] == np.inf,
            df['C'].between(-70, 70) | df['C'] == np.inf,
            df['D'].between(100, 300) | df['D'] == np.inf,
            df['E'].between(100, 300) | df['E'] == np.inf]

To get the total number of failed parameters for each item:

bool_df = ~pd.concat(condlist, axis=1).astype('bool')

df['#Fails'] = bool_df.sum(axis=1)

To know who are the parameters out of limits, for each item:

df['Fail'] = pd.concat(condlist, axis=1).melt(ignore_index=False) \
               .loc[lambda x: ~x['value']].groupby(level=0)['variable'].apply(list)

In this way I get two columns with the wanted results.

Filter data frame on multiple conditions

Here's How I would do this.

 `data = yf.download('spy', start='1990-01-01', end='2000-01-01')`

Output:

print(data)
                 Open       High         Low       Close  Adj Close   Volume
Date                                                                        
1993-01-29   43.96875   43.96875   43.750000   43.937500  25.438084  1003200
1993-02-01   43.96875   44.25000   43.968750   44.250000  25.619017   480500
1993-02-02   44.21875   44.37500   44.125000   44.343750  25.673300   201300
1993-02-03   44.40625   44.84375   44.375000   44.812500  25.944689   529400
1993-02-04   44.96875   45.09375   44.468750   45.000000  26.053246   531500
              ...        ...         ...         ...        ...      ...
1999-12-27  146.50000  146.78125  145.062500  146.281250  96.697540  2691000
1999-12-28  145.87500  146.50000  145.484375  146.187500  96.635559  4084500
1999-12-29  146.31250  146.81250  145.312500  146.812500  97.048737  3001000
1999-12-30  147.12500  147.56250  146.187500  146.640625  96.935066  3641300
1999-12-31  146.84375  147.50000  146.250000  146.875000  97.090034  3172700

[1749 rows x 6 columns]

Filter out the data to only get the rows where 'Open' > 10 and 'Close' < 50
df = data[(data['Open'] > 10) & (data['Close'] < 50)]

Output:

print(df)
                 Open      High        Low      Close  Adj Close   Volume
Date                                                                     
1993-01-29  43.968750  43.96875  43.750000  43.937500  25.438084  1003200
1993-02-01  43.968750  44.25000  43.968750  44.250000  25.619017   480500
1993-02-02  44.218750  44.37500  44.125000  44.343750  25.673300   201300
1993-02-03  44.406250  44.84375  44.375000  44.812500  25.944689   529400
1993-02-04  44.968750  45.09375  44.468750  45.000000  26.053246   531500
              ...       ...        ...        ...        ...      ...
1995-03-17  49.437500  49.62500  49.406250  49.562500  30.364269    89900
1995-03-20  49.625000  49.62500  49.468750  49.562500  30.364269    91700
1995-03-21  49.562500  49.87500  49.359375  49.437500  30.287670   104400
1995-03-22  49.531250  49.53125  49.328125  49.484375  30.316399    74900
1995-03-23  49.421875  49.65625  49.359375  49.515625  30.335543   220500

[543 rows x 6 columns]

Create a column that shifts the dates, essentially creating a column that contains the following date in that fitlered dataframe
df = df.reset_index(drop=False)
df['Next_Date'] = df['Date'].shift(-1)

Output:

print(df)
          Date       Open      High  ...  Adj Close   Volume  Next_Date
0   1993-01-29  43.968750  43.96875  ...  25.438084  1003200 1993-02-01
1   1993-02-01  43.968750  44.25000  ...  25.619017   480500 1993-02-02
2   1993-02-02  44.218750  44.37500  ...  25.673300   201300 1993-02-03
3   1993-02-03  44.406250  44.84375  ...  25.944689   529400 1993-02-04
4   1993-02-04  44.968750  45.09375  ...  26.053246   531500 1993-02-05
..         ...        ...       ...  ...        ...      ...        ...
538 1995-03-17  49.437500  49.62500  ...  30.364269    89900 1995-03-20
539 1995-03-20  49.625000  49.62500  ...  30.364269    91700 1995-03-21
540 1995-03-21  49.562500  49.87500  ...  30.287670   104400 1995-03-22
541 1995-03-22  49.531250  49.53125  ...  30.316399    74900 1995-03-23
542 1995-03-23  49.421875  49.65625  ...  30.335543   220500        NaT

[543 rows x 8 columns]

Get the difference in days for each row and the following date column we just created
df['Difference'] = (df['Next_Date'] - df['Date']).dt.days

Output:

print(df)
          Date       Open      High  ...   Volume  Next_Date  Difference
0   1993-01-29  43.968750  43.96875  ...  1003200 1993-02-01         3.0
1   1993-02-01  43.968750  44.25000  ...   480500 1993-02-02         1.0
2   1993-02-02  44.218750  44.37500  ...   201300 1993-02-03         1.0
3   1993-02-03  44.406250  44.84375  ...   529400 1993-02-04         1.0
4   1993-02-04  44.968750  45.09375  ...   531500 1993-02-05         1.0
..         ...        ...       ...  ...      ...        ...         ...
538 1995-03-17  49.437500  49.62500  ...    89900 1995-03-20         3.0
539 1995-03-20  49.625000  49.62500  ...    91700 1995-03-21         1.0
540 1995-03-21  49.562500  49.87500  ...   104400 1995-03-22         1.0
541 1995-03-22  49.531250  49.53125  ...    74900 1995-03-23         1.0
542 1995-03-23  49.421875  49.65625  ...   220500        NaT         NaN

[543 rows x 9 columns]

Filter on that difference of days by your "n_days"
n_days = 2
df = df[df['Difference'] <= n_days]

Output:

print(df)
          Date      Open       High  ...  Volume  Next_Date  Difference
1   1993-02-01  43.96875  44.250000  ...  480500 1993-02-02         1.0
2   1993-02-02  44.21875  44.375000  ...  201300 1993-02-03         1.0
3   1993-02-03  44.40625  44.843750  ...  529400 1993-02-04         1.0
4   1993-02-04  44.96875  45.093750  ...  531500 1993-02-05         1.0
6   1993-02-08  44.96875  45.125000  ...  596100 1993-02-09         1.0
..         ...       ...        ...  ...     ...        ...         ...
536 1995-03-15  49.50000  49.578125  ...  278500 1995-03-16         1.0
537 1995-03-16  49.43750  49.812500  ...   20400 1995-03-17         1.0
539 1995-03-20  49.62500  49.625000  ...   91700 1995-03-21         1.0
540 1995-03-21  49.56250  49.875000  ...  104400 1995-03-22         1.0
541 1995-03-22  49.53125  49.531250  ...   74900 1995-03-23         1.0

[430 rows x 9 columns]

Full Code:

import yfinance as yf

data = yf.download('spy', start='1990-01-01', end='2000-01-01')
n_days = 2

df = data[(data['Open'] > 10) & (data['Close'] < 50)]
df = df.reset_index(drop=False)
df['Next_Date'] = df['Date'].shift(-1)
df['Difference'] = (df['Next_Date'] - df['Date']).dt.days

df = df[df['Difference'] <= n_days]

Pandas: Filter correctly Dataframe columns considering multiple conditions

You have an operator precedence issue; In python, | operator has higher precedence than ==, wrapping comparison expressions in parenthesis should solve your problem, also since funny, useful and cool columns are str type, use string '1' instead of number 1:

filtered_data = df[(df['star_rating'] >= 3) & ((df['funny']=='1') | (df['useful']=='1') | (df['cool']=='1'))]

Check result here

Besides using |, you can also compare multiple columns in one go and then check condition with any:

filtered_data = df[(df['star_rating'] >= 3) & df[['funny', 'useful', 'cool']].eq('1').any(axis=1)]

Filter a dataframe based on condition in columns selected by name pattern

You can filter multiple columns at once using if_all:

library(dplyr)

df %>%
  filter(if_all(matches("_qvalue"), ~ . < 0.05))

In this case I use the filtering condition x < 0.05 on all columns whose name matches _qvalue.

Your second approach can also work if you group by ID first and then use all inside filter:

df_ID = df %>% mutate(ID = 1:n())

df_ID %>%
  select(contains("qval"), ID) %>% 
  gather(variable, value, -ID) %>% 
  group_by(ID) %>% 
  filter(all(value < 0.05)) %>%
  semi_join(df_ID, by = "ID")

How to Filter Data Frame with Conditions of Two Columns

Efficient way to apply multiple filters to pandas DataFrame or Series

Pandas: Filtering multiple conditions

Filtering two columns of a dataframe with filter

Pandas filtering based on 2 different columns conditions

Dataframe filtering with multiple conditions on different columns

Filter data frame on multiple conditions

Pandas: Filter correctly Dataframe columns considering multiple conditions

Filter a dataframe based on condition in columns selected by name pattern

Related Topics

Leave a reply