Selecting With Complex Criteria from Pandas.Dataframe

Selecting with complex criteria from pandas.DataFrame

Sure! Setup:

>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
>>> df
A B C
0 9 40 300
1 9 70 700
2 5 70 900
3 8 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800

We can apply column operations and get boolean Series objects:

>>> df["B"] > 50
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
9 True
Name: B
>>> (df["B"] > 50) & (df["C"] == 900)
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False

[Update, to switch to new-style .loc]:

And then we can use these to index into the object. For read access, you can chain indices:

>>> df["A"][(df["B"] > 50) & (df["C"] == 900)]
2 5
3 8
Name: A, dtype: int64

but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc instead:

>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]
2 5
3 8
Name: A, dtype: int64
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].values
array([5, 8], dtype=int64)
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000
>>> df
A B C
0 9 40 300
1 9 70 700
2 5000 70 900
3 8000 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800

Note that I accidentally typed == 900 and not != 900, or ~(df["C"] == 900), but I'm too lazy to fix it. Exercise for the reader. :^)

Lazy way of selecting rows by multiple conditions

You could use pandas query for this:

df.query('name == "A" and action=="buy"')

name action class
0 A buy sub
5 A buy none

If you are certain of the columns that will be queried, you can shift them into the index and select via MultiIndexing:

df = df.set_index(['name', 'action'])
df = df.sort_index() # avoid performance issues

df.loc(axis=0)['A', 'buy']

class
name action
A buy sub
buy none

filtering data from data frame with complex criteria in pandas python

This is my approach. Let's begin by treating the "date" column as a datetime object:

df["date"] = pd.to_datetime(df["date"])

Now, we can group the data as you described to find the rows where the minimum discount is reached:

common_groupby = df.groupby(["color", "range", "expecteddate", "vehicle"])

Now, let's find the rows where the minimum discount and the maximum date happens for each group:

source_idx = common_groupby["discount"].idxmin()
target_idx = common_groupby["date"].idxmax()
# Use df.loc[idx] to see the rows where the minimum discount is reached

Now, we can add the values of the minimum discount and the respective date (from source_idx rows) to the correct rows (from target_src) by using .loc:

df.loc[target_idx, "mindisc"] = df.loc[source_idx, "discount"].values
df.loc[target_idx, "mindate"] = df.loc[source_idx, "date"].values

As you can see, we're only changing the values in the rows where the minimum discount was reached (idx). This is the output of these operations:

   vehicle expecteddate  range color  discount       date  mindisc    mindate
0 car 2/24/2022 240 blue 70 2022-02-18 60.0 2022-02-17
1 car 2/24/2022 240 red 80 2022-02-18 40.0 2022-02-17
2 car 3/15/2022 240 blue 90 2022-02-18 50.0 2022-02-17
3 car 3/15/2022 240 red 60 2022-02-18 30.0 2022-02-17
4 car 4/20/2022 240 blue 40 2022-02-18 40.0 2022-02-18
5 car 4/20/2022 240 red 50 2022-02-18 45.0 2022-02-17
6 bus 2/24/2022 300 blue 120 2022-02-18 120.0 2022-02-18
7 bus 2/24/2022 300 red 110 2022-02-18 100.0 2022-02-17
8 bus 3/15/2022 300 blue 130 2022-02-18 130.0 2022-02-18
9 bus 3/15/2022 300 red 140 2022-02-18 120.0 2022-02-17
10 bus 4/20/2022 300 blue 80 2022-02-18 80.0 2022-02-18
11 bus 4/20/2022 300 red 90 2022-02-18 30.0 2022-02-17
12 car 2/24/2022 240 blue 60 2022-02-17 NaN NaT
13 car 2/24/2022 240 red 40 2022-02-17 NaN NaT
14 car 3/15/2022 240 blue 50 2022-02-17 NaN NaT
15 car 3/15/2022 240 red 30 2022-02-17 NaN NaT
16 car 4/20/2022 240 blue 70 2022-02-17 NaN NaT
17 car 4/20/2022 240 red 45 2022-02-17 NaN NaT
18 bus 2/24/2022 300 blue 130 2022-02-17 NaN NaT
19 bus 2/24/2022 300 red 100 2022-02-17 NaN NaT
20 bus 3/15/2022 300 blue 140 2022-02-17 NaN NaT
21 bus 3/15/2022 300 red 120 2022-02-17 NaN NaT
22 bus 4/20/2022 300 blue 90 2022-02-17 NaN NaT
23 bus 4/20/2022 300 red 30 2022-02-17 NaN NaT

Another possible solution trying to improve perormance by sorting and then droping duplicates twice (inspired by this question: Select the max row per group - pandas performance issue):

# Min discount rows
source_df = df.sort_values(by=["discount"], ascending=True, kind='mergesort').drop_duplicates(["color", "range", "expecteddate", "vehicle"])
source_df = source_df.rename(columns={"date": "mindate", "discount": "mindisc"})

# Max date rows
target_df = df.reset_index().sort_values(by=["date"], ascending=False, kind="mergesort").drop_duplicates(["color", "range", "expecteddate", "vehicle"])

# Put min discount values into max date rows
df.loc[target_df["index"], ["mindisc", "mindate"]] = source_df[["mindisc", "mindate"]].values

Select certain columns based on multiple criteria in pandas

Currently, what you are doing will return every column because of how the conditions are written. endswith will accept tuples so just put all the columns you are looking for in a single tuple and then filter

my_df[my_df.columns[~my_df.columns.str.endswith(('_pct','_ln','gibberish','tech'))]]

id type sales sales_roi flag
0 1 corp 34567 0.10 0
1 2 smb 2190 0.21 1
2 3 smb 1870 0.22 0
3 4 corp 22000 0.15 1
4 5 mid 10000 0.16 1

Pandas multiple condition and get dataframe

Use & instead of and and put brackets around each value test:

df_result = df[(df.a == 1) & (df.b == 0) & (df.c == 0) & (df.d == 0)]

Alternatively, to avoid using extra brackets, you can use .eq():

df_result = df[df.a.eq(1) & df.b.eq(0) & df.c.eq(0) & df.d.eq(0)]

Pandas: Selecting and modifying dataframe based on even more complex criteria

You can use searchsorted

Copy

labels = np.array(list('NLMH'))
breaks = np.array([1, 10, 50])
pd.DataFrame(
labels[breaks.searchsorted(df.values)].reshape(df.shape),
df.index, df.columns)

A B C
A N L M
B N H L
C L M N

In Place

labels = np.array(list('NLMH'))
breaks = np.array([1, 10, 50])
df[:] = labels[breaks.searchsorted(df.values)].reshape(df.shape)
df

A B C
A N L M
B N H L
C L M N

Chained pure Pandas approach with pandas.DataFrame.mask

Deprecated since version 0.21

df.mask(df.lt(1), 'N').mask(df.gt(1) & df.lt(10), 'L') \
.mask(df.gt(10) & df.lt(50), 'M').mask(df.gt(50), 'H')

A B C
A N L M
B N H L
C L M N

Select data based on multiple criteria using Pandas

You'll simply need to add the conditions inside parenthesis inside your .loc and not repeat a DF filter inside the df filter:

First, creating a crude datasample, as you didn't provide one besides the image:

# creating the values, first one will be ID, then next 4 will be the values to compare
check_values = [
[1, 5, 10, 20, 30],
[2, 5, 11, 32, 11],
[3, 10, 10, 20, 20],
[4, 9, 9, 11, 11],
[5, 11, 23, 41, 11]
]

# creating columns names
check_cols = ['id', 'A', 'B', 'C', 'D']

# making the DataFrame
dfcheck = pd.DataFrame(check_values, columns=check_cols)

# Setting the id column, just because
dfcheck.set_index('id', inplace=True)

The solution, where you need to nest each condition inside parenthesis:

dfcheck.loc[(dfcheck['A'] == dfcheck['B']) & (dfcheck['C'] == dfcheck['D'])]

EDIT: What you missed/did wrong?:

Looking at your filter, you're adding unecessary dfMerged inside your parenthesis, your code broken in lines (delete everything inside "** CODE **"):

dfequal=
dfMerged.loc[(dfMerged['MetCode']==dfMerged['GCD_METCODE'])
& (**dfMerged[**dfMerged['Zone Code']==dfMerged['GCD_Senior_ZONE']**]**)
& (**dfMerged[**dfMerged['Municipality Code']==dfMerged['GCD_CSDUID']**]**)]

So you see, that you're searching inside a search that it's not needed? It should be:

dfequal=
dfMerged.loc[(dfMerged['MetCode']==dfMerged['GCD_METCODE'])
& (dfMerged['Zone Code']==dfMerged['GCD_Senior_ZONE'])
& (dfMerged['Municipality Code']==dfMerged['GCD_CSDUID'])]

Pandas loc multiple conditions

You should separate the two propositions:

df1=df.loc[~(df['A']=='blue') & ~(df['B']=='green')]

Pandas slicing/selecting with multiple conditions with or statement

The important thing to note is that & is not identical to and; they are different things so the "or" equivalent to & is |

Normally both & and | are bitwise logical operators rather than the python "logical" operators.

In pandas these operators are overloaded for Series operation.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame([[1,4,3],[2,3,5],[4,5,6],[3,2,5]], columns=['a', 'b',
...: 'c'])

In [4]: df
Out[4]:
a b c
0 1 4 3
1 2 3 5
2 4 5 6
3 3 2 5

In [5]: df.loc[(df.a != 1) & (df.b < 5)]
Out[5]:
a b c
1 2 3 5
3 3 2 5

In [6]: df.loc[(df.a != 1) | (df.b < 5)]
Out[6]:
a b c
0 1 4 3
1 2 3 5
2 4 5 6
3 3 2 5


Related Topics



Leave a reply



Submit