How to Select Rows in a Dataframe Between Two Values, in Python Pandas

How to select rows in a DataFrame between two values, in Python Pandas?

You should use () to group your boolean vector to remove ambiguity.

df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]

Pandas - select all rows between two values when a string is a match

df3 = df.merge(df2, on='fruit', how='inner') # Thanks for Henry Ecker for suggesting inner join
df3 = df3.loc[(df3['min'] < df3['values']) & (df3['max'] > df3['values'])]
df3

Output

    fruit   values  min max
3 apple 883 467 947
6 apple 805 467 947
9 apple 932 467 947
11 peach 331 307 618
12 apple 665 467 947

If we don't want min and max col in output

df3 = df3.drop(columns=['min', 'max'])
df3

Output

    fruit   values
3 apple 883
6 apple 805
9 apple 932
11 peach 331
12 apple 665

Select DataFrame rows between two dates

There are two possible solutions:

  • Use a boolean mask, then use df.loc[mask]
  • Set the date column as a DatetimeIndex, then use df[start_date : end_date]

Using a boolean mask:

Ensure df['date'] is a Series with dtype datetime64[ns]:

df['date'] = pd.to_datetime(df['date'])  

Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:

#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)

Select the sub-DataFrame:

df.loc[mask]

or re-assign to df

df = df.loc[mask]

For example,

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])

yields

            0         1         2       date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10

Using a DatetimeIndex:

If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])

yields

                   0         1         2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337

While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.


Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).

Pandas selecting rows with multiple conditions

You can use between. By default, it's both sides inclusive.

out = df[df['C'].between(0,1)]

If you want only one side inclusive, you can select that as well. For example, the following is only right-side inclusive:

out = df[df['C'].between(0,1, inclusive='right')]

Output:

          A         B         C
0 1.764052 0.400157 0.978738

Select rows from pandas dataframe by two values at the same time from rows in another dataframe

Having data normalized should ease this and every potential comparison much easier and cleaner.

Normalize + Apply clean conditions:

df1 = pd.concat([df1.drop(columns=['data']), pd.json_normalize(df1.data)], axis=1)
df2 = pd.concat([df2.drop(columns=['data']), pd.json_normalize(df2.data)], axis=1)

Now dataframes look as follows:

df1:



Leave a reply



Submit