How to select rows in a DataFrame between two values, in Python Pandas?
You should use ()
to group your boolean vector to remove ambiguity.
df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]
Pandas - select all rows between two values when a string is a match
df3 = df.merge(df2, on='fruit', how='inner') # Thanks for Henry Ecker for suggesting inner join
df3 = df3.loc[(df3['min'] < df3['values']) & (df3['max'] > df3['values'])]
df3
Output
fruit values min max
3 apple 883 467 947
6 apple 805 467 947
9 apple 932 467 947
11 peach 331 307 618
12 apple 665 467 947
If we don't want min
and max
col in output
df3 = df3.drop(columns=['min', 'max'])
df3
Output
fruit values
3 apple 883
6 apple 805
9 apple 932
11 peach 331
12 apple 665
Select DataFrame rows between two dates
There are two possible solutions:
- Use a boolean mask, then use
df.loc[mask]
- Set the date column as a DatetimeIndex, then use
df[start_date : end_date]
Using a boolean mask:
Ensure df['date']
is a Series with dtype datetime64[ns]
:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date
and end_date
can be datetime.datetime
s,np.datetime64
s, pd.Timestamp
s, or even datetime strings:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set thedate
column as the index first. Then you can select rows by date usingdf.loc[start_date:end_date]
.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end]
includes start
but not end
, in contrast, Pandas df.loc[start_date : end_date]
includes both end-points in the result if they are in the index. Neither start_date
nor end_date
has to be in the index however.
Also note that pd.read_csv
has a parse_dates
parameter which you could use to parse the date
column as datetime64
s. Thus, if you use parse_dates
, you would not need to use df['date'] = pd.to_datetime(df['date'])
.
Pandas selecting rows with multiple conditions
You can use between
. By default, it's both sides inclusive.
out = df[df['C'].between(0,1)]
If you want only one side inclusive, you can select that as well. For example, the following is only right-side inclusive:
out = df[df['C'].between(0,1, inclusive='right')]
Output:
A B C
0 1.764052 0.400157 0.978738
Select rows from pandas dataframe by two values at the same time from rows in another dataframe
Having data normalized should ease this and every potential comparison much easier and cleaner.
Normalize + Apply clean conditions:
df1 = pd.concat([df1.drop(columns=['data']), pd.json_normalize(df1.data)], axis=1)
df2 = pd.concat([df2.drop(columns=['data']), pd.json_normalize(df2.data)], axis=1)
Now dataframes look as follows:
df1:
user_id | time | av | si | am | |
---|---|---|---|---|---|
0 | 12 | t1 | 8 | 3 | 2 |
1 | 22 | t2 | 8 | 44 | nan |
2 | 33 | t3 | 8 | 1 | nan |
3 | 44 | t4 | 8 | 22 | nan |
Related Topics
Circular Import Dependency in Python
How to Bind Self Events in Tkinter Text Widget After It Will Binded by Text Widget
Single VS Double Quotes in JSON
Using Python Requests with JavaScript Pages
How to Time a Code Segment for Testing Performance with Pythons Timeit
Where Is Python's Sys.Path Initialized From
How to Correctly Clean Up a Python Object
How to Convert Surrogate Pairs to Normal String in Python
Importerror: No Module Named 'Tkinter'
Typeerror: Can't Convert 'Int' Object to Str Implicitly
Why Python 3.6.1 Throws Attributeerror: Module 'Enum' Has No Attribute 'Intflag'
Rewrite Multiple Lines in the Console
Python Try...Except Comma VS 'As' in Except
Convert Numpy Array to Python List
Determine Function Name from Within That Function (Without Using Traceback)