How to Filter Data Frame with Conditions of Two Columns

Efficient way to apply multiple filters to pandas DataFrame or Series

Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]:
1 1
2 2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]:
col1 col2
1 1 11
2 2 12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]:
col1 col2
1 1 11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n): 
return op(x[col],n)

In [15]: def f(x, *b):
return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]:
col1 col2
1 1 11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
col1 col2
1 1 11

Pandas: Filtering multiple conditions

Use () because operator precedence:

temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]

Alternatively, create conditions on separate rows:

cond1 = df["bin"] == 3    
cond2 = df["days since"] > 7
cond3 = ~df["Def"]

temp2 = df[cond1 & cond2 & cond3]

Sample:

df = pd.DataFrame({'Def':[True] *2 + [False]*4,
'days since':[7,8,9,14,2,13],
'bin':[1,3,5,3,3,3]})

print (df)
Def bin days since
0 True 1 7
1 True 3 8
2 False 5 9
3 False 3 14
4 False 3 2
5 False 3 13

temp2 = df[~df["Def"] & (df["days since"] > 7) & (df["bin"] == 3)]
print (temp2)
Def bin days since
3 False 3 14
5 False 3 13

Filtering two columns of a dataframe with filter

Rather than using filter, I would suggest a more idiomatic way to proceed.

Suppose you want to filter on the word "Mortar":

# Simply define two filtering masks, since one column contains lists of strings,
# whereas the other one simply contains strings
mask1 = df["Name"].apply(lambda x: "Mortar".casefold() in "".join(x).casefold())
mask2 = df["NAME_FILE"].apply(lambda x: "Mortar".casefold() in x.casefold())
# If a filtered word is present in either column of the same row,
# then the whole row should be kept
print(df.loc[mask1 | mask2, :])

Name NAME_FILE
0 [ Verbundmörtel , Compound Mortar , Malta p... AdhesiveCoveringPlaster_2
^^^^^^
1 [ StoLevell In Absolute , StoLevell In Absolu... AdhesiveMortarLevellInForAEVERO_720
^^^^^^

Pandas filtering based on 2 different columns conditions

Use this:

data = data.loc[ ~((data.Name == 'RACHEL') & (data.Job == 'CHEF')) ]

You want to remove all the rows that have both Name = RACHEL and Job = CHEF. So just write that condition and invert it to filter them out.

Dataframe filtering with multiple conditions on different columns

Using the answer by Corralien and taking advantage of what's written by tiitinha, and considering the possibility to have some NaN values, here is how I put it all together:

df.replace([np.nan],  np.inf,inplace=True)

condlist = [df['A'].between(-100, 100) | df['A'] == np.inf,
df['B'].between(-100, 100) | df['B'] == np.inf,
df['C'].between(-70, 70) | df['C'] == np.inf,
df['D'].between(100, 300) | df['D'] == np.inf,
df['E'].between(100, 300) | df['E'] == np.inf]

To get the total number of failed parameters for each item:

bool_df = ~pd.concat(condlist, axis=1).astype('bool')

df['#Fails'] = bool_df.sum(axis=1)

To know who are the parameters out of limits, for each item:

df['Fail'] = pd.concat(condlist, axis=1).melt(ignore_index=False) \
.loc[lambda x: ~x['value']].groupby(level=0)['variable'].apply(list)

In this way I get two columns with the wanted results.

Filter data frame on multiple conditions

Here's How I would do this.

 `data = yf.download('spy', start='1990-01-01', end='2000-01-01')`

Output:

print(data)
Open High Low Close Adj Close Volume
Date
1993-01-29 43.96875 43.96875 43.750000 43.937500 25.438084 1003200
1993-02-01 43.96875 44.25000 43.968750 44.250000 25.619017 480500
1993-02-02 44.21875 44.37500 44.125000 44.343750 25.673300 201300
1993-02-03 44.40625 44.84375 44.375000 44.812500 25.944689 529400
1993-02-04 44.96875 45.09375 44.468750 45.000000 26.053246 531500
... ... ... ... ... ...
1999-12-27 146.50000 146.78125 145.062500 146.281250 96.697540 2691000
1999-12-28 145.87500 146.50000 145.484375 146.187500 96.635559 4084500
1999-12-29 146.31250 146.81250 145.312500 146.812500 97.048737 3001000
1999-12-30 147.12500 147.56250 146.187500 146.640625 96.935066 3641300
1999-12-31 146.84375 147.50000 146.250000 146.875000 97.090034 3172700

[1749 rows x 6 columns]
  1. Filter out the data to only get the rows where 'Open' > 10 and 'Close' < 50

    df = data[(data['Open'] > 10) & (data['Close'] < 50)]

Output:

print(df)
Open High Low Close Adj Close Volume
Date
1993-01-29 43.968750 43.96875 43.750000 43.937500 25.438084 1003200
1993-02-01 43.968750 44.25000 43.968750 44.250000 25.619017 480500
1993-02-02 44.218750 44.37500 44.125000 44.343750 25.673300 201300
1993-02-03 44.406250 44.84375 44.375000 44.812500 25.944689 529400
1993-02-04 44.968750 45.09375 44.468750 45.000000 26.053246 531500
... ... ... ... ... ...
1995-03-17 49.437500 49.62500 49.406250 49.562500 30.364269 89900
1995-03-20 49.625000 49.62500 49.468750 49.562500 30.364269 91700
1995-03-21 49.562500 49.87500 49.359375 49.437500 30.287670 104400
1995-03-22 49.531250 49.53125 49.328125 49.484375 30.316399 74900
1995-03-23 49.421875 49.65625 49.359375 49.515625 30.335543 220500

[543 rows x 6 columns]

  1. Create a column that shifts the dates, essentially creating a column that contains the following date in that fitlered dataframe

    df = df.reset_index(drop=False)

    df['Next_Date'] = df['Date'].shift(-1)

Output:

print(df)
Date Open High ... Adj Close Volume Next_Date
0 1993-01-29 43.968750 43.96875 ... 25.438084 1003200 1993-02-01
1 1993-02-01 43.968750 44.25000 ... 25.619017 480500 1993-02-02
2 1993-02-02 44.218750 44.37500 ... 25.673300 201300 1993-02-03
3 1993-02-03 44.406250 44.84375 ... 25.944689 529400 1993-02-04
4 1993-02-04 44.968750 45.09375 ... 26.053246 531500 1993-02-05
.. ... ... ... ... ... ... ...
538 1995-03-17 49.437500 49.62500 ... 30.364269 89900 1995-03-20
539 1995-03-20 49.625000 49.62500 ... 30.364269 91700 1995-03-21
540 1995-03-21 49.562500 49.87500 ... 30.287670 104400 1995-03-22
541 1995-03-22 49.531250 49.53125 ... 30.316399 74900 1995-03-23
542 1995-03-23 49.421875 49.65625 ... 30.335543 220500 NaT

[543 rows x 8 columns]

  1. Get the difference in days for each row and the following date column we just created

    df['Difference'] = (df['Next_Date'] - df['Date']).dt.days

Output:

print(df)
Date Open High ... Volume Next_Date Difference
0 1993-01-29 43.968750 43.96875 ... 1003200 1993-02-01 3.0
1 1993-02-01 43.968750 44.25000 ... 480500 1993-02-02 1.0
2 1993-02-02 44.218750 44.37500 ... 201300 1993-02-03 1.0
3 1993-02-03 44.406250 44.84375 ... 529400 1993-02-04 1.0
4 1993-02-04 44.968750 45.09375 ... 531500 1993-02-05 1.0
.. ... ... ... ... ... ... ...
538 1995-03-17 49.437500 49.62500 ... 89900 1995-03-20 3.0
539 1995-03-20 49.625000 49.62500 ... 91700 1995-03-21 1.0
540 1995-03-21 49.562500 49.87500 ... 104400 1995-03-22 1.0
541 1995-03-22 49.531250 49.53125 ... 74900 1995-03-23 1.0
542 1995-03-23 49.421875 49.65625 ... 220500 NaT NaN

[543 rows x 9 columns]

  1. Filter on that difference of days by your "n_days"

    n_days = 2

    df = df[df['Difference'] <= n_days]

Output:

print(df)
Date Open High ... Volume Next_Date Difference
1 1993-02-01 43.96875 44.250000 ... 480500 1993-02-02 1.0
2 1993-02-02 44.21875 44.375000 ... 201300 1993-02-03 1.0
3 1993-02-03 44.40625 44.843750 ... 529400 1993-02-04 1.0
4 1993-02-04 44.96875 45.093750 ... 531500 1993-02-05 1.0
6 1993-02-08 44.96875 45.125000 ... 596100 1993-02-09 1.0
.. ... ... ... ... ... ... ...
536 1995-03-15 49.50000 49.578125 ... 278500 1995-03-16 1.0
537 1995-03-16 49.43750 49.812500 ... 20400 1995-03-17 1.0
539 1995-03-20 49.62500 49.625000 ... 91700 1995-03-21 1.0
540 1995-03-21 49.56250 49.875000 ... 104400 1995-03-22 1.0
541 1995-03-22 49.53125 49.531250 ... 74900 1995-03-23 1.0

[430 rows x 9 columns]

Full Code:

import yfinance as yf

data = yf.download('spy', start='1990-01-01', end='2000-01-01')
n_days = 2

df = data[(data['Open'] > 10) & (data['Close'] < 50)]
df = df.reset_index(drop=False)
df['Next_Date'] = df['Date'].shift(-1)
df['Difference'] = (df['Next_Date'] - df['Date']).dt.days

df = df[df['Difference'] <= n_days]

Pandas: Filter correctly Dataframe columns considering multiple conditions

You have an operator precedence issue; In python, | operator has higher precedence than ==, wrapping comparison expressions in parenthesis should solve your problem, also since funny, useful and cool columns are str type, use string '1' instead of number 1:

filtered_data = df[(df['star_rating'] >= 3) & ((df['funny']=='1') | (df['useful']=='1') | (df['cool']=='1'))]

Check result here

Besides using |, you can also compare multiple columns in one go and then check condition with any:

filtered_data = df[(df['star_rating'] >= 3) & df[['funny', 'useful', 'cool']].eq('1').any(axis=1)]

Filter a dataframe based on condition in columns selected by name pattern

You can filter multiple columns at once using if_all:

library(dplyr)

df %>%
filter(if_all(matches("_qvalue"), ~ . < 0.05))

In this case I use the filtering condition x < 0.05 on all columns whose name matches _qvalue.

Your second approach can also work if you group by ID first and then use all inside filter:

df_ID = df %>% mutate(ID = 1:n())

df_ID %>%
select(contains("qval"), ID) %>%
gather(variable, value, -ID) %>%
group_by(ID) %>%
filter(all(value < 0.05)) %>%
semi_join(df_ID, by = "ID")


Related Topics



Leave a reply



Submit