Subset Rows According to a Range of Time

Subset rows according to a range of time

I'd use the lubridate package and the hour() function to make your life easier...

require( lubridate )

with( df , df[ hour( date_time ) >= 2 & hour( date_time ) < 5 , ] )

# date_time loc_id node energy kgco2
#3 2009-02-27 02:05:05 87 103 6.40039 3.43701
#4 2009-02-27 03:05:05 87 103 4.79883 2.57697
#5 2009-02-27 04:05:05 87 103 4.10156 2.20254

Subset rows according to a range of time (incl. minutes)

In the code below, for each row I calculate the number of seconds since midnight and check whether that value is within the time range in your question, also converted to seconds since midnight. I've included the code to set up the data with a datetime format (and UTC time zone) since the data sample wasn't provided in reproducible form.

1. Set up the data frame

library(lubridate)
library(tidyverse)

dat = read.table(text="date_time time loc_id node energy kgco2
1 2009-02-27 00:11:08 87 103 0.00000 0.00000
2 2009-02-27 01:05:05 87 103 7.00000 3.75900
3 2009-02-27 02:05:05 87 103 6.40039 3.43701
4 2009-02-28 02:10:05 87 103 4.79883 2.57697
5 2009-02-28 04:05:05 87 103 4.10156 2.20254
6 2009-02-28 05:05:05 87 103 2.59961 1.39599
7 2009-03-01 03:20:05 87 103 2.59961 1.39599",
header=TRUE, stringsAsFactors=FALSE)

dat$date_time = as.POSIXct(paste0(dat$date_time, dat$time), tz="UTC")
dat = dat %>% select(-time)

2. Helper function to convert hms time strings to seconds since midnight

hms_to_numeric = function(x) {
x = as.POSIXct(paste("2010-01-01", x))
3600 * hour(x) + 60 * minute(x) + second(x)
}

3. Filter the data to include only rows within the time range

dat %>% 
filter(between(as.numeric(date_time) - as.numeric(as.POSIXct(substr(date_time,1,10), tz="UTC")),
hms_to_numeric("02:05:00"),
hms_to_numeric("03:30:00")))
            date_time loc_id node  energy   kgco2
1 2009-02-27 02:05:05 87 103 6.40039 3.43701
2 2009-02-28 02:10:05 87 103 4.79883 2.57697
3 2009-03-01 03:20:05 87 103 2.59961 1.39599

Subset a data frame based on a time sequence

Possibility 1: lexicographic comparison

If all time values are stored as zero-padded 24 hour strings with the same delimiters, such as %H:%M:%S, then a lexicographic comparison can be used to apply the filter.

DF[DF$Date%in%DATES & DF$Time>='00:04:00' & DF$Time<='00:06:00',];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25

Lexicographic solutions are, of course, not ideal, because they do not lend themselves to time-based math, such as adding, subtracting, multiplying, dividing, etc.

Better solutions involve transforming the time values to a numerical type that encodes time durations as an offset from an explicit or unspecified base time. This is how popular date/time libraries encode types, such as boost date_time for C++, Joda-Time for Java, and POSIXct, difftime, and lubridate for R.


Possibility 2: manual numerics

It's possible to parse the strings ourselves to construct numerics representing the time durations, and use numerical comparison to apply the filter.

hmsToDouble <- function(hms) as.double(substr(hms,1,2))*3600 + as.double(substr(hms,4,5))*60 + as.double(substr(hms,7,8));
DF[DF$Date%in%DATES & hmsToDouble(DF$Time)>=hmsToDouble('00:04:00') & hmsToDouble(DF$Time)<=hmsToDouble('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25

Possibility 3: POSIXt

We can generate vectors of POSIXt (that is, POSIXct or POSIXlt) values and use vectorized comparisons against these vectors.

DF[DF$Date%in%DATES & DF$DateTime>=as.POSIXct(paste0(DF$Date,' 00:04:00')) & DF$DateTime<=as.POSIXct(paste0(DF$Date,' 00:06:00')),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25

Possibility 4: difftime

The only built-in time duration data type in R is the difftime type, which can be a little bit finicky to work with. But for this problem, it's fairly straightforward.

DF[DF$Date%in%DATES & as.difftime(DF$Time)>=as.difftime('00:04:00') & as.difftime(DF$Time)<=as.difftime('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25

Possibility 5: lubridate

The lubridate package is widely considered to be the best package for date/time handling in R. It provides a duration type which represents regular time durations, and a period type which allows representing counts of various irregular time units. Historically, date/time libraries have sometimes failed because they lacked an appreciation for the distinction between irregular time periods and regular time durations.

In the following solution, the hms() calls return instances of the period type, hence we are actually comparing separate time units. Incidentally, with respect to the actual storage of the separate time units, lubridate's design is to store the seconds values as the actual payload of the double vector, and the remaining units (minutes, hours, days, months, and years) as attributes on the object.

library(lubridate);
DF[DF$Date%in%DATES & hms(DF$Time)>=hms('00:04:00') & hms(DF$Time)<=hms('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25

subset a dataframe in R within a specific time range

Here's a dplyr solution:

library(dplyr)
dataBase %>%
mutate(date = as.Date(date, format = "%d/%m/%Y")) %>%
filter(date >= "2020-07-30" & date <= "2020-08-30")
a date
V12 -0.23017749 2020-08-28
V13 1.55870831 2020-08-01
V21 0.07050839 2020-08-27
V32 -1.26506123 2020-08-01
V41 -0.44566197 2020-08-28
V43 0.35981383 2020-08-01
V52 0.11068272 2020-08-01

Data:

set.seed(123)
dataBase <- data.frame(a = rnorm(15), date = unlist(read.table(text = '"30/06/2020" "27/08/2020" "30/06/2020" "28/08/2020" "30/06/2020"
"28/08/2020" "30/06/2020" "01/08/2020" "30/06/2020" "01/08/2020"
"01/08/2020" "30/06/2020" "30/06/2020" "01/08/2020" "30/06/2019"')))

Subset column based on a range of time

I downloaded your data and had a look. If I am not mistaken, all you need is to subset data using Time.h. Here you have a range of time (10-23) you want. I used dplyr and did the following. You are asking R to pick up rows which have values between 10 and 23 in Time.h. Your data frame is called mydf here.

library(dplyr)
filter(mydf, between(Time.h, 10, 23))

How can I subset a dataframe based on time of day in r?

Using lubridate

 library(lubridate)
df <- data.frame(ID = c(1,2,3,4),Street = c("Saints Road","Saints Road","Saints Road","Saints Road"),Date = c("2020-12-31 23:00:00","2021-01-01 03:00:00","2021-06-01 04:00:00","2021-07-06 22:00:00"))
df$Date <- as.POSIXlt(df$Date)

df %>%
filter(hour(Date) >= 3 & hour(Date) <= 21)

Output:

 ID      Street                Date
1 2 Saints Road 2021-01-01 03:00:00
2 3 Saints Road 2021-06-01 04:00:00

Select DataFrame rows between two dates

There are two possible solutions:

  • Use a boolean mask, then use df.loc[mask]
  • Set the date column as a DatetimeIndex, then use df[start_date : end_date]

Using a boolean mask:

Ensure df['date'] is a Series with dtype datetime64[ns]:

df['date'] = pd.to_datetime(df['date'])  

Make a boolean mask. start_date and end_date can be datetime.datetimes,
np.datetime64s, pd.Timestamps, or even datetime strings:

#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)

Select the sub-DataFrame:

df.loc[mask]

or re-assign to df

df = df.loc[mask]

For example,

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])

yields

            0         1         2       date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10

Using a DatetimeIndex:

If you are going to do a lot of selections by date, it may be quicker to set the
date column as the index first. Then you can select rows by date using
df.loc[start_date:end_date].

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])

yields

                   0         1         2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337

While Python list indexing, e.g. seq[start:end] includes start but not end, in contrast, Pandas df.loc[start_date : end_date] includes both end-points in the result if they are in the index. Neither start_date nor end_date has to be in the index however.


Also note that pd.read_csv has a parse_dates parameter which you could use to parse the date column as datetime64s. Thus, if you use parse_dates, you would not need to use df['date'] = pd.to_datetime(df['date']).

Subsetting data.table set by date range in R

Why not:

testset[date>="2013-08-02" & date<="2013-11-01"]

R: Subset rows from dataframe based on range of given time

Using dplyr and lubridate gives a solution.

Before we start make sure all dates are formatted in the same way:

df1 <- df1 %>%
mutate(Start_Date=ymd(Start_Date), End_Date=dmy(End_Date))

df2 <- df2 %>%
mutate(DateTime=ymd(DateTime))

In your case it's only necessary for your column End_Date.

First I crossjoin both data.frames since I don't see any easy solution for combining both dfs.

df3 <- merge(df1, df2, all=TRUE)

Next using filter and between

df3 %>% 
filter(between(DateTime, Start_Date, End_Date)) %>%
select(-c(Start_Date, End_Date))

gives

  Value   DateTime
1 3 2003-01-01
2 3 2003-05-09
3 4 2004-12-31
4 5 2005-01-31
5 5 2005-08-13

Another option using package data.table

setDT(df1)
setDT(df2)
df1[df2, on = .(Start_Date <= DateTime, End_Date >= DateTime),
.(DateTime, Value)]

yields

     DateTime Value
1: 2003-01-01 3
2: 2003-05-09 3
3: 2004-12-31 4
4: 2005-01-31 5
5: 2005-08-13 5


Related Topics



Leave a reply



Submit