Selecting With Complex Criteria from Pandas.Dataframe

Selecting with complex criteria from pandas.DataFrame

Sure! Setup:

>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
                   'B': [randint(1, 9)*10 for x in range(10)],
                   'C': [randint(1, 9)*100 for x in range(10)]})
>>> df
   A   B    C
0  9  40  300
1  9  70  700
2  5  70  900
3  8  80  900
4  7  50  200
5  9  30  900
6  2  80  700
7  2  80  400
8  5  80  300
9  7  70  800

We can apply column operations and get boolean Series objects:

>>> df["B"] > 50
0    False
1     True
2     True
3     True
4    False
5    False
6     True
7     True
8     True
9     True
Name: B
>>> (df["B"] > 50) & (df["C"] == 900)
0    False
1    False
2     True
3     True
4    False
5    False
6    False
7    False
8    False
9    False

[Update, to switch to new-style .loc]:

And then we can use these to index into the object. For read access, you can chain indices:

>>> df["A"][(df["B"] > 50) & (df["C"] == 900)]
2    5
3    8
Name: A, dtype: int64

but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc instead:

>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]
2    5
3    8
Name: A, dtype: int64
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].values
array([5, 8], dtype=int64)
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000
>>> df
      A   B    C
0     9  40  300
1     9  70  700
2  5000  70  900
3  8000  80  900
4     7  50  200
5     9  30  900
6     2  80  700
7     2  80  400
8     5  80  300
9     7  70  800

Note that I accidentally typed == 900 and not != 900, or ~(df["C"] == 900), but I'm too lazy to fix it. Exercise for the reader. :^)

Lazy way of selecting rows by multiple conditions

You could use pandas query for this:

df.query('name == "A" and action=="buy"')

 name action class
0    A    buy   sub
5    A    buy  none

If you are certain of the columns that will be queried, you can shift them into the index and select via MultiIndexing:

df = df.set_index(['name', 'action'])
df = df.sort_index() # avoid performance issues 

df.loc(axis=0)['A', 'buy']
 
            class
name action      
A    buy      sub
     buy     none

filtering data from data frame with complex criteria in pandas python

This is my approach. Let's begin by treating the "date" column as a datetime object:

df["date"] = pd.to_datetime(df["date"])

Now, we can group the data as you described to find the rows where the minimum discount is reached:

common_groupby = df.groupby(["color", "range", "expecteddate", "vehicle"])

Now, let's find the rows where the minimum discount and the maximum date happens for each group:

source_idx = common_groupby["discount"].idxmin()
target_idx = common_groupby["date"].idxmax()
# Use df.loc[idx] to see the rows where the minimum discount is reached

Now, we can add the values of the minimum discount and the respective date (from source_idx rows) to the correct rows (from target_src) by using .loc:

df.loc[target_idx, "mindisc"] = df.loc[source_idx, "discount"].values
df.loc[target_idx, "mindate"] = df.loc[source_idx, "date"].values

As you can see, we're only changing the values in the rows where the minimum discount was reached (idx). This is the output of these operations:

   vehicle expecteddate  range color  discount       date  mindisc    mindate
0      car    2/24/2022    240  blue        70 2022-02-18     60.0 2022-02-17
1      car    2/24/2022    240   red        80 2022-02-18     40.0 2022-02-17
2      car    3/15/2022    240  blue        90 2022-02-18     50.0 2022-02-17
3      car    3/15/2022    240   red        60 2022-02-18     30.0 2022-02-17
4      car    4/20/2022    240  blue        40 2022-02-18     40.0 2022-02-18
5      car    4/20/2022    240   red        50 2022-02-18     45.0 2022-02-17
6      bus    2/24/2022    300  blue       120 2022-02-18    120.0 2022-02-18
7      bus    2/24/2022    300   red       110 2022-02-18    100.0 2022-02-17
8      bus    3/15/2022    300  blue       130 2022-02-18    130.0 2022-02-18
9      bus    3/15/2022    300   red       140 2022-02-18    120.0 2022-02-17
10     bus    4/20/2022    300  blue        80 2022-02-18     80.0 2022-02-18
11     bus    4/20/2022    300   red        90 2022-02-18     30.0 2022-02-17
12     car    2/24/2022    240  blue        60 2022-02-17      NaN        NaT
13     car    2/24/2022    240   red        40 2022-02-17      NaN        NaT
14     car    3/15/2022    240  blue        50 2022-02-17      NaN        NaT
15     car    3/15/2022    240   red        30 2022-02-17      NaN        NaT
16     car    4/20/2022    240  blue        70 2022-02-17      NaN        NaT
17     car    4/20/2022    240   red        45 2022-02-17      NaN        NaT
18     bus    2/24/2022    300  blue       130 2022-02-17      NaN        NaT
19     bus    2/24/2022    300   red       100 2022-02-17      NaN        NaT
20     bus    3/15/2022    300  blue       140 2022-02-17      NaN        NaT
21     bus    3/15/2022    300   red       120 2022-02-17      NaN        NaT
22     bus    4/20/2022    300  blue        90 2022-02-17      NaN        NaT
23     bus    4/20/2022    300   red        30 2022-02-17      NaN        NaT

Another possible solution trying to improve perormance by sorting and then droping duplicates twice (inspired by this question: Select the max row per group - pandas performance issue):

# Min discount rows
source_df = df.sort_values(by=["discount"], ascending=True, kind='mergesort').drop_duplicates(["color", "range", "expecteddate", "vehicle"])
source_df = source_df.rename(columns={"date": "mindate", "discount": "mindisc"})

# Max date rows
target_df = df.reset_index().sort_values(by=["date"], ascending=False, kind="mergesort").drop_duplicates(["color", "range", "expecteddate", "vehicle"])

# Put min discount values into max date rows
df.loc[target_df["index"], ["mindisc", "mindate"]] = source_df[["mindisc", "mindate"]].values

Select certain columns based on multiple criteria in pandas

Currently, what you are doing will return every column because of how the conditions are written. endswith will accept tuples so just put all the columns you are looking for in a single tuple and then filter

my_df[my_df.columns[~my_df.columns.str.endswith(('_pct','_ln','gibberish','tech'))]]

   id  type  sales  sales_roi  flag
0   1  corp  34567       0.10     0
1   2   smb   2190       0.21     1
2   3   smb   1870       0.22     0
3   4  corp  22000       0.15     1
4   5   mid  10000       0.16     1

Pandas multiple condition and get dataframe

Use & instead of and and put brackets around each value test:

df_result = df[(df.a == 1) & (df.b == 0) & (df.c == 0) & (df.d == 0)]

Alternatively, to avoid using extra brackets, you can use .eq():

df_result = df[df.a.eq(1) & df.b.eq(0) & df.c.eq(0) & df.d.eq(0)]

Pandas: Selecting and modifying dataframe based on even more complex criteria

You can use searchsorted

Copy

labels = np.array(list('NLMH'))
breaks = np.array([1, 10, 50])
pd.DataFrame(
    labels[breaks.searchsorted(df.values)].reshape(df.shape),
    df.index, df.columns)

   A  B  C
A  N  L  M
B  N  H  L
C  L  M  N

In Place

labels = np.array(list('NLMH'))
breaks = np.array([1, 10, 50])
df[:] = labels[breaks.searchsorted(df.values)].reshape(df.shape)
df

   A  B  C
A  N  L  M
B  N  H  L
C  L  M  N

Chained pure Pandas approach with `pandas.DataFrame.mask`

Deprecated since version 0.21

df.mask(df.lt(1), 'N').mask(df.gt(1) & df.lt(10), 'L') \
  .mask(df.gt(10) & df.lt(50), 'M').mask(df.gt(50), 'H')

   A  B  C
A  N  L  M
B  N  H  L
C  L  M  N

Select data based on multiple criteria using Pandas

You'll simply need to add the conditions inside parenthesis inside your .loc and not repeat a DF filter inside the df filter:

First, creating a crude datasample, as you didn't provide one besides the image:

# creating the values, first one will be ID, then next 4 will be the values to compare
check_values = [
    [1, 5, 10, 20, 30],
    [2, 5, 11, 32, 11],
    [3, 10, 10, 20, 20],
    [4, 9, 9, 11, 11],
    [5, 11, 23, 41, 11]
]

# creating columns names
check_cols = ['id', 'A', 'B', 'C', 'D']

# making the DataFrame
dfcheck = pd.DataFrame(check_values, columns=check_cols)

# Setting the id column, just because
dfcheck.set_index('id', inplace=True)

The solution, where you need to nest each condition inside parenthesis:

dfcheck.loc[(dfcheck['A'] == dfcheck['B']) & (dfcheck['C'] == dfcheck['D'])]

EDIT: What you missed/did wrong?:

Looking at your filter, you're adding unecessary dfMerged inside your parenthesis, your code broken in lines (delete everything inside "** CODE **"):

dfequal=
dfMerged.loc[(dfMerged['MetCode']==dfMerged['GCD_METCODE']) 
& (**dfMerged[**dfMerged['Zone Code']==dfMerged['GCD_Senior_ZONE']**]**) 
& (**dfMerged[**dfMerged['Municipality Code']==dfMerged['GCD_CSDUID']**]**)]

So you see, that you're searching inside a search that it's not needed? It should be:

dfequal=
dfMerged.loc[(dfMerged['MetCode']==dfMerged['GCD_METCODE']) 
& (dfMerged['Zone Code']==dfMerged['GCD_Senior_ZONE']) 
& (dfMerged['Municipality Code']==dfMerged['GCD_CSDUID'])]

Pandas loc multiple conditions

You should separate the two propositions:

df1=df.loc[~(df['A']=='blue') & ~(df['B']=='green')]

Pandas slicing/selecting with multiple conditions with or statement

The important thing to note is that & is not identical to and; they are different things so the "or" equivalent to & is |

Normally both & and | are bitwise logical operators rather than the python "logical" operators.

In pandas these operators are overloaded for Series operation.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame([[1,4,3],[2,3,5],[4,5,6],[3,2,5]], columns=['a', 'b',
   ...:  'c'])

In [4]: df
Out[4]:
   a  b  c
0  1  4  3
1  2  3  5
2  4  5  6
3  3  2  5

In [5]: df.loc[(df.a != 1) & (df.b < 5)]
Out[5]:
   a  b  c
1  2  3  5
3  3  2  5

In [6]: df.loc[(df.a != 1) | (df.b < 5)]
Out[6]:
   a  b  c
0  1  4  3
1  2  3  5
2  4  5  6
3  3  2  5

Selecting With Complex Criteria from Pandas.Dataframe