Subset a Dataframe Between 2 Dates

Select DataFrame rows between two dates

There are two possible solutions:

  • Use a boolean mask, then use df.loc[mask]
  • Set the date column as a DatetimeIndex, then use df[start_date : end_date]

Using a boolean mask:

Ensure df['date'] is a Series with dtype datetime64[ns]:

df['date'] = pd.to_datetime(df['date'])  

Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:

#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)

Select the sub-DataFrame:

df.loc[mask]

or re-assign to df

df = df.loc[mask]

For example,

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])

yields

            0         1         2       date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10

Using a DatetimeIndex:

If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])

yields

                   0         1         2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337

While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.


Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).

Pandas: Select all data from Pandas DataFrame between two dates

I used the following test DataFrame:

   Transaction_date  Value
0 2019-12-31 10
1 2020-01-10 10
2 2020-01-15 10
3 2020-01-20 10
4 2020-01-25 10
5 2020-01-28 10
6 2020-01-29 10
7 2020-01-30 10
8 2020-01-31 10
9 2020-02-01 10
10 2020-02-01 10

Start / end dates are:

start_day = '01.01.2020'
end_day = '31.01.2020'

The code is:

# Convert start / end dates to datetime
start_day = pd.to_datetime(start_day)
end_day = pd.to_datetime(end_day)

I noticed that you use datetime module, which I think is a bad practice.
To this end use dedicated pandasonic function (to_datetime).
It is even clever enough to recognize many of commonly used date formats,
so there is no need to specify it on your own.

To get the actual result, run:

df[df['Transaction_date'].between(start_day, end_day)]

The result is:

  Transaction_date  Value
1 2020-01-10 10
2 2020-01-15 10
3 2020-01-20 10
4 2020-01-25 10
5 2020-01-28 10
6 2020-01-29 10
7 2020-01-30 10
8 2020-01-31 10

So as you can see:

  • The date from 2019 has been eliminated.
  • The output contains only January dates, to the very end.
  • Dates from the beginning of February have also been eliminated.

Repeat my example on your computer. Maybe the source of your problem
(that the result dous not contain all dates from January) is somewhere
else?

Extract a subset given two dates from a python dataframe with timezone date format

pd.to_datetime with utc=True

You can pass the optional parameter utc=True to pd.to_datetime function in order to convert the timezone-aware inputs to UTC. Then you should be able to compare the date column with start_date and end_date in order to subset the df

Here is the step by step example,

print(df)

ID date
0 1 2022-02-03 22:01:12+01:00
1 2 2022-02-04 21:11:21+01:00
2 3 2022-02-05 11:11:21+01:00
3 4 2022-02-07 23:01:12+01:00
4 5 2022-02-07 14:31:14+02:00
5 6 2022-02-08 18:12:01+02:00
6 7 2022-02-09 20:21:02+02:00
7 8 2022-02-11 15:41:25+02:00
8 9 2022-02-15 11:21:27+02:00

# Convert to UTC
df['date'] = pd.to_datetime(df['date'], utc=True)
print(df)

ID date
0 1 2022-02-03 21:01:12+00:00
1 2 2022-02-04 20:11:21+00:00
2 3 2022-02-05 10:11:21+00:00
3 4 2022-02-07 22:01:12+00:00
4 5 2022-02-07 12:31:14+00:00
5 6 2022-02-08 16:12:01+00:00
6 7 2022-02-09 18:21:02+00:00
7 8 2022-02-11 13:41:25+00:00
8 9 2022-02-15 09:21:27+00:00

# Filter the rows with boolean indexing
subset = df[df['date'].between('2022-02-03 21:01:12', '2022-02-07 22:01:11')]
print(subset)

ID date
0 1 2022-02-03 21:01:12+00:00
1 2 2022-02-04 20:11:21+00:00
2 3 2022-02-05 10:11:21+00:00
4 5 2022-02-07 12:31:14+00:00

How to select dataframe rows between two datetimes?

pd.date_range

i = pd.date_range('2018-01-01 09:00:00', periods=3, freq='1H')

# Result
DatetimeIndex(['2018-01-01 09:00:00', '2018-01-01 10:00:00',
'2018-01-01 11:00:00'],
dtype='datetime64[ns]', freq='H')

to_datetime

df['date'] = pd.to_datetime(df.date)

loc

df.loc[(df.date >= i[0]) & (df.date <= i[-1])]

date msft nok aapl ibm amzn
1 2018-01-01 09:00:00 112 1 143 130 1298
2 2018-01-01 10:00:00 109 10 185 137 1647
3 2018-01-01 11:00:00 98 11 146 105 1331

How to subset a Pandas dataframe by date when the column also contains hours, minutes, seconds, millis, and timezone

You can check between

out = df[df['timestamp'].between(start_date,end_date)]
Out[219]:
foo timestamp
2 eggs 2022-04-18 06:38:28.928778887-04:00

Select Data frame between two dates of a date column

First you have to convert your starting date and final date into a datetime format. Then you can apply multiple conditions inside df.loc. Do not forget to reassign your modifications to your df :

import pandas as pd
from datetime import datetime

df['date']=pd.to_datetime(df['date'], format='%m/%d/%y')

date1 = datetime.strptime('2013-03-23', '%Y-%m-%d')
date2 = datetime.strptime('2013-03-25', '%Y-%m-%d')

df = df.loc[(df['date']>date1) & (df['date']<date2)]

Pandas: Selecting DataFrame rows between two dates (Datetime Index)

Option 1:

Sample df:

df
a
2015-07-16 07:14:41 12
2015-07-16 07:14:48 34
2015-07-16 07:14:54 65
2015-07-16 07:15:01 34
2015-07-16 07:15:07 23
2015-07-16 07:15:14 1

It looks like you're trying this without .loc (won't work without it):

df.loc['2015-07-16 07:00:00':'2015-07-16 23:00:00']
a
2015-07-16 07:14:41 12
2015-07-16 07:14:48 34
2015-07-16 07:14:54 65
2015-07-16 07:15:01 34
2015-07-16 07:15:07 23
2015-07-16 07:15:14 1

Option 2:

You can use boolean indexing on the index:

df[(df.index.get_level_values(0) >= '2015-07-16 07:00:00') & (df.index.get_level_values(0) <= '2015-07-16 23:00:00')]

R - How to subset a table between two specific dates?

You need to convert the date column in the file to date class. For example:

 LValley <- read.table("LValley.txt", header=TRUE,dec=",", sep="\t", stringsAsFactors=FALSE)

date1 <- as.Date(LValley$date, "%d.%m.%Y %H:%M")
Test2007 <- subset(LValley, date1>=DATE1 & date1 <=DATE2)
dim(Test2007)
#[1] 6249 4


Related Topics



Leave a reply



Submit