Select Dataframe Rows Between Two Dates

Select DataFrame rows between two dates

There are two possible solutions:

  • Use a boolean mask, then use df.loc[mask]
  • Set the date column as a DatetimeIndex, then use df[start_date : end_date]

Using a boolean mask:

Ensure df['date'] is a Series with dtype datetime64[ns]:

df['date'] = pd.to_datetime(df['date'])  

Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:

#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)

Select the sub-DataFrame:

df.loc[mask]

or re-assign to df

df = df.loc[mask]

For example,

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])

yields

            0         1         2       date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10

Using a DatetimeIndex:

If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])

yields

                   0         1         2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337

While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.


Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).

Pandas: Select all data from Pandas DataFrame between two dates

I used the following test DataFrame:

   Transaction_date  Value
0 2019-12-31 10
1 2020-01-10 10
2 2020-01-15 10
3 2020-01-20 10
4 2020-01-25 10
5 2020-01-28 10
6 2020-01-29 10
7 2020-01-30 10
8 2020-01-31 10
9 2020-02-01 10
10 2020-02-01 10

Start / end dates are:

start_day = '01.01.2020'
end_day = '31.01.2020'

The code is:

# Convert start / end dates to datetime
start_day = pd.to_datetime(start_day)
end_day = pd.to_datetime(end_day)

I noticed that you use datetime module, which I think is a bad practice.
To this end use dedicated pandasonic function (to_datetime).
It is even clever enough to recognize many of commonly used date formats,
so there is no need to specify it on your own.

To get the actual result, run:

df[df['Transaction_date'].between(start_day, end_day)]

The result is:

  Transaction_date  Value
1 2020-01-10 10
2 2020-01-15 10
3 2020-01-20 10
4 2020-01-25 10
5 2020-01-28 10
6 2020-01-29 10
7 2020-01-30 10
8 2020-01-31 10

So as you can see:

  • The date from 2019 has been eliminated.
  • The output contains only January dates, to the very end.
  • Dates from the beginning of February have also been eliminated.

Repeat my example on your computer. Maybe the source of your problem
(that the result dous not contain all dates from January) is somewhere
else?

How to select pandas dataframe rows between two dates without knowing the exact time

You can try boolean masking:

df.loc[(df['TIME'].dt.date > start_date.date()) & (df['TIME'].dt.date< end_date,date())]

OR

You can also use boolean masking and between() method:

df[df['TIME'].dt.date.between(start_date.date(),end_date.date())]

Select rows between two dates - recent 3 month period -

By following your code, I found that the data type of both 'start_date' and 'end_date' is Series (NOT Timestamp like df['week']). Check by:

type(df['week'][0]), type(start_date)

Then they cannot be compared.
You may try this code:

id = start_date.index[0]

start_date = pd.to_datetime(start_date[id])

Note: "id" stores the index (based on your data, it is 199.)

Select Pandas dataframe rows between two dates

If your dataframes are not very big, you can simply do the join on a dummy key and then do filtering to narrow it down to what you need. See example below (note that I had to update your example a little bit to have correct date formatting)

import pandas as pd

rates = {'rate': [ 0.974, 0.966, 0.996, 0.998, 0.994, 1.006, 1.042, 1.072, 0.954],
'valid_from': ['31/12/2018','15/01/2019','01/02/2019','01/03/2019','01/04/2019','15/04/2019','01/05/2019','01/06/2019','30/06/2019'],
'valid_to': ['14/01/2019','31/01/2019','28/02/2019','31/03/2019','14/04/2019','30/04/2019','31/05/2019','29/06/2019','31/07/2019']}

df1 = pd.DataFrame(rates)
df1['valid_to'] = pd.to_datetime(df1['valid_to'],format ='%d/%m/%Y')
df1['valid_from'] = pd.to_datetime(df1['valid_from'],format='%d/%m/%Y')

Then you df1 would be

        rate    valid_from  valid_to
0 0.974 2018-12-31 2019-01-14
1 0.966 2019-01-15 2019-01-31
2 0.996 2019-02-01 2019-02-28
3 0.998 2019-03-01 2019-03-31
4 0.994 2019-04-01 2019-04-14
5 1.006 2019-04-15 2019-04-30
6 1.042 2019-05-01 2019-05-31
7 1.072 2019-06-01 2019-06-29
8 0.954 2019-06-30 2019-07-31

This is your second data frame df2

data = {'date': ['03/01/2019','23/01/2019','27/02/2019','14/03/2019','05/04/2019','30/04/2019','14/06/2019'],
'amount': [200,305,155,67,95,174,236,]}

df2 = pd.DataFrame(data)
df2['date'] = pd.to_datetime(df2['date'],format ='%d/%m/%Y')

Then your df2 would look like the following

     date   amount
0 2019-01-03 200
1 2019-01-23 305
2 2019-02-27 155
3 2019-03-14 67
4 2019-04-05 95
5 2019-04-30 174
6 2019-06-14 236

Your solution:

df1['key'] = 1
df2['key'] = 1
df_output = pd.merge(df1, df2, on='key').drop('key',axis=1)
df_output = df_output[(df_output['date'] > df_output['valid_from']) & (df_output['date'] <= df_output['valid_to'])]

This is how would the result look like df_output:

    rate    valid_from  valid_to    date    amount
0 0.974 2018-12-31 2019-01-14 2019-01-03 200
8 0.966 2019-01-15 2019-01-31 2019-01-23 305
16 0.996 2019-02-01 2019-02-28 2019-02-27 155
24 0.998 2019-03-01 2019-03-31 2019-03-14 67
32 0.994 2019-04-01 2019-04-14 2019-04-05 95
40 1.006 2019-04-15 2019-04-30 2019-04-30 174
55 1.072 2019-06-01 2019-06-29 2019-06-14 236

How to subset a Pandas dataframe by date when the column also contains hours, minutes, seconds, millis, and timezone

You can check between

out = df[df['timestamp'].between(start_date,end_date)]
Out[219]:
foo timestamp
2 eggs 2022-04-18 06:38:28.928778887-04:00

How to select dataframe rows between two datetimes?

pd.date_range

i = pd.date_range('2018-01-01 09:00:00', periods=3, freq='1H')

# Result
DatetimeIndex(['2018-01-01 09:00:00', '2018-01-01 10:00:00',
'2018-01-01 11:00:00'],
dtype='datetime64[ns]', freq='H')

to_datetime

df['date'] = pd.to_datetime(df.date)

loc

df.loc[(df.date >= i[0]) & (df.date <= i[-1])]

date msft nok aapl ibm amzn
1 2018-01-01 09:00:00 112 1 143 130 1298
2 2018-01-01 10:00:00 109 10 185 137 1647
3 2018-01-01 11:00:00 98 11 146 105 1331

Create a row for each year between two dates

You can use a custom function to compute the range then explode the column:

# Ensure to have datetime
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])

# Create the new column
date_range = lambda x: range(x['date1'].year, x['date2'].year+1)
df = df.assign(year=df.apply(date_range, axis=1)).explode('year', ignore_index=True)

Output:

>>> df
date1 date2 year
0 2018-01-01 2020-01-01 2018
1 2018-01-01 2020-01-01 2019
2 2018-01-01 2020-01-01 2020

Select rows between two DatetimeIndex dates

Using query method:

df = pd.read_csv("my_file.csv", index_col=1, parse_dates=True)

In [121]: df.query("'2017-05-30' <= index <= '2017-06-01'")
Out[121]:
vm LoadInt1
time
2017-05-31 10:00:00 abc-webapp-02 3.133333
2017-05-31 10:05:00 abc-webapp-02 0.000000
2017-05-31 10:10:00 abc-webapp-02 0.000000
2017-05-31 10:15:00 abc-webapp-02 0.000000
2017-05-31 10:20:00 abc-webapp-02 0.000000
2017-05-31 10:25:00 abc-webapp-02 0.000000
2017-05-31 10:30:00 abc-webapp-02 0.000000
2017-05-31 10:35:00 abc-webapp-02 0.000000
2017-05-31 10:40:00 abc-webapp-02 0.000000


Related Topics



Leave a reply



Submit