Subset rows according to a range of time
I'd use the lubridate
package and the hour()
function to make your life easier...
require( lubridate )
with( df , df[ hour( date_time ) >= 2 & hour( date_time ) < 5 , ] )
# date_time loc_id node energy kgco2
#3 2009-02-27 02:05:05 87 103 6.40039 3.43701
#4 2009-02-27 03:05:05 87 103 4.79883 2.57697
#5 2009-02-27 04:05:05 87 103 4.10156 2.20254
Subset rows according to a range of time (incl. minutes)
In the code below, for each row I calculate the number of seconds since midnight and check whether that value is within the time range in your question, also converted to seconds since midnight. I've included the code to set up the data with a datetime format (and UTC time zone) since the data sample wasn't provided in reproducible form.
1. Set up the data frame
library(lubridate)
library(tidyverse)
dat = read.table(text="date_time time loc_id node energy kgco2
1 2009-02-27 00:11:08 87 103 0.00000 0.00000
2 2009-02-27 01:05:05 87 103 7.00000 3.75900
3 2009-02-27 02:05:05 87 103 6.40039 3.43701
4 2009-02-28 02:10:05 87 103 4.79883 2.57697
5 2009-02-28 04:05:05 87 103 4.10156 2.20254
6 2009-02-28 05:05:05 87 103 2.59961 1.39599
7 2009-03-01 03:20:05 87 103 2.59961 1.39599",
header=TRUE, stringsAsFactors=FALSE)
dat$date_time = as.POSIXct(paste0(dat$date_time, dat$time), tz="UTC")
dat = dat %>% select(-time)
2. Helper function to convert hms time strings to seconds since midnight
hms_to_numeric = function(x) {
x = as.POSIXct(paste("2010-01-01", x))
3600 * hour(x) + 60 * minute(x) + second(x)
}
3. Filter the data to include only rows within the time range
dat %>%
filter(between(as.numeric(date_time) - as.numeric(as.POSIXct(substr(date_time,1,10), tz="UTC")),
hms_to_numeric("02:05:00"),
hms_to_numeric("03:30:00")))
date_time loc_id node energy kgco2
1 2009-02-27 02:05:05 87 103 6.40039 3.43701
2 2009-02-28 02:10:05 87 103 4.79883 2.57697
3 2009-03-01 03:20:05 87 103 2.59961 1.39599
Subset a data frame based on a time sequence
Possibility 1: lexicographic comparison
If all time values are stored as zero-padded 24 hour strings with the same delimiters, such as %H:%M:%S
, then a lexicographic comparison can be used to apply the filter.
DF[DF$Date%in%DATES & DF$Time>='00:04:00' & DF$Time<='00:06:00',];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
Lexicographic solutions are, of course, not ideal, because they do not lend themselves to time-based math, such as adding, subtracting, multiplying, dividing, etc.
Better solutions involve transforming the time values to a numerical type that encodes time durations as an offset from an explicit or unspecified base time. This is how popular date/time libraries encode types, such as boost date_time for C++, Joda-Time for Java, and POSIXct, difftime, and lubridate for R.
Possibility 2: manual numerics
It's possible to parse the strings ourselves to construct numerics representing the time durations, and use numerical comparison to apply the filter.
hmsToDouble <- function(hms) as.double(substr(hms,1,2))*3600 + as.double(substr(hms,4,5))*60 + as.double(substr(hms,7,8));
DF[DF$Date%in%DATES & hmsToDouble(DF$Time)>=hmsToDouble('00:04:00') & hmsToDouble(DF$Time)<=hmsToDouble('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
Possibility 3: POSIXt
We can generate vectors of POSIXt (that is, POSIXct or POSIXlt) values and use vectorized comparisons against these vectors.
DF[DF$Date%in%DATES & DF$DateTime>=as.POSIXct(paste0(DF$Date,' 00:04:00')) & DF$DateTime<=as.POSIXct(paste0(DF$Date,' 00:06:00')),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
Possibility 4: difftime
The only built-in time duration data type in R is the difftime type, which can be a little bit finicky to work with. But for this problem, it's fairly straightforward.
DF[DF$Date%in%DATES & as.difftime(DF$Time)>=as.difftime('00:04:00') & as.difftime(DF$Time)<=as.difftime('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
Possibility 5: lubridate
The lubridate package is widely considered to be the best package for date/time handling in R. It provides a duration type which represents regular time durations, and a period type which allows representing counts of various irregular time units. Historically, date/time libraries have sometimes failed because they lacked an appreciation for the distinction between irregular time periods and regular time durations.
In the following solution, the hms()
calls return instances of the period type, hence we are actually comparing separate time units. Incidentally, with respect to the actual storage of the separate time units, lubridate's design is to store the seconds values as the actual payload of the double vector, and the remaining units (minutes, hours, days, months, and years) as attributes on the object.
library(lubridate);
DF[DF$Date%in%DATES & hms(DF$Time)>=hms('00:04:00') & hms(DF$Time)<=hms('00:06:00'),];
## DateTime XXX Time Date
## 1373 2016-04-25 00:06:00 14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00 3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00 2 00:04:00 2016-04-25
subset a dataframe in R within a specific time range
Here's a dplyr
solution:
library(dplyr)
dataBase %>%
mutate(date = as.Date(date, format = "%d/%m/%Y")) %>%
filter(date >= "2020-07-30" & date <= "2020-08-30")
a date
V12 -0.23017749 2020-08-28
V13 1.55870831 2020-08-01
V21 0.07050839 2020-08-27
V32 -1.26506123 2020-08-01
V41 -0.44566197 2020-08-28
V43 0.35981383 2020-08-01
V52 0.11068272 2020-08-01
Data:
set.seed(123)
dataBase <- data.frame(a = rnorm(15), date = unlist(read.table(text = '"30/06/2020" "27/08/2020" "30/06/2020" "28/08/2020" "30/06/2020"
"28/08/2020" "30/06/2020" "01/08/2020" "30/06/2020" "01/08/2020"
"01/08/2020" "30/06/2020" "30/06/2020" "01/08/2020" "30/06/2019"')))
Subset column based on a range of time
I downloaded your data and had a look. If I am not mistaken, all you need is to subset data using Time.h
. Here you have a range of time (10-23) you want. I used dplyr
and did the following. You are asking R to pick up rows which have values between 10 and 23 in Time.h
. Your data frame is called mydf
here.
library(dplyr)
filter(mydf, between(Time.h, 10, 23))
How can I subset a dataframe based on time of day in r?
Using lubridate
library(lubridate)
df <- data.frame(ID = c(1,2,3,4),Street = c("Saints Road","Saints Road","Saints Road","Saints Road"),Date = c("2020-12-31 23:00:00","2021-01-01 03:00:00","2021-06-01 04:00:00","2021-07-06 22:00:00"))
df$Date <- as.POSIXlt(df$Date)
df %>%
filter(hour(Date) >= 3 & hour(Date) <= 21)
Output:
ID Street Date
1 2 Saints Road 2021-01-01 03:00:00
2 3 Saints Road 2021-06-01 04:00:00
Select DataFrame rows between two dates
There are two possible solutions:
- Use a boolean mask, then use
df.loc[mask]
- Set the date column as a DatetimeIndex, then use
df[start_date : end_date]
Using a boolean mask:
Ensure df['date']
is a Series with dtype datetime64[ns]
:
df['date'] = pd.to_datetime(df['date'])
Make a boolean mask. start_date
and end_date
can be datetime.datetime
s,np.datetime64
s, pd.Timestamp
s, or even datetime strings:
#greater than the start date and smaller than the end date
mask = (df['date'] > start_date) & (df['date'] <= end_date)
Select the sub-DataFrame:
df.loc[mask]
or re-assign to df
df = df.loc[mask]
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
mask = (df['date'] > '2000-6-1') & (df['date'] <= '2000-6-10')
print(df.loc[mask])
yields
0 1 2 date
153 0.208875 0.727656 0.037787 2000-06-02
154 0.750800 0.776498 0.237716 2000-06-03
155 0.812008 0.127338 0.397240 2000-06-04
156 0.639937 0.207359 0.533527 2000-06-05
157 0.416998 0.845658 0.872826 2000-06-06
158 0.440069 0.338690 0.847545 2000-06-07
159 0.202354 0.624833 0.740254 2000-06-08
160 0.465746 0.080888 0.155452 2000-06-09
161 0.858232 0.190321 0.432574 2000-06-10
Using a DatetimeIndex:
If you are going to do a lot of selections by date, it may be quicker to set thedate
column as the index first. Then you can select rows by date usingdf.loc[start_date:end_date]
.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2000-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2000-6-1':'2000-6-10'])
yields
0 1 2
date
2000-06-01 0.040457 0.326594 0.492136 # <- includes start_date
2000-06-02 0.279323 0.877446 0.464523
2000-06-03 0.328068 0.837669 0.608559
2000-06-04 0.107959 0.678297 0.517435
2000-06-05 0.131555 0.418380 0.025725
2000-06-06 0.999961 0.619517 0.206108
2000-06-07 0.129270 0.024533 0.154769
2000-06-08 0.441010 0.741781 0.470402
2000-06-09 0.682101 0.375660 0.009916
2000-06-10 0.754488 0.352293 0.339337
While Python list indexing, e.g. seq[start:end]
includes start
but not end
, in contrast, Pandas df.loc[start_date : end_date]
includes both end-points in the result if they are in the index. Neither start_date
nor end_date
has to be in the index however.
Also note that pd.read_csv
has a parse_dates
parameter which you could use to parse the date
column as datetime64
s. Thus, if you use parse_dates
, you would not need to use df['date'] = pd.to_datetime(df['date'])
.
Subsetting data.table set by date range in R
Why not:
testset[date>="2013-08-02" & date<="2013-11-01"]
R: Subset rows from dataframe based on range of given time
Using dplyr
and lubridate
gives a solution.
Before we start make sure all dates are formatted in the same way:
df1 <- df1 %>%
mutate(Start_Date=ymd(Start_Date), End_Date=dmy(End_Date))
df2 <- df2 %>%
mutate(DateTime=ymd(DateTime))
In your case it's only necessary for your column End_Date
.
First I crossjoin both data.frames since I don't see any easy solution for combining both dfs.
df3 <- merge(df1, df2, all=TRUE)
Next using filter
and between
df3 %>%
filter(between(DateTime, Start_Date, End_Date)) %>%
select(-c(Start_Date, End_Date))
gives
Value DateTime
1 3 2003-01-01
2 3 2003-05-09
3 4 2004-12-31
4 5 2005-01-31
5 5 2005-08-13
Another option using package data.table
setDT(df1)
setDT(df2)
df1[df2, on = .(Start_Date <= DateTime, End_Date >= DateTime),
.(DateTime, Value)]
yields
DateTime Value
1: 2003-01-01 3
2: 2003-05-09 3
3: 2004-12-31 4
4: 2005-01-31 5
5: 2005-08-13 5
Related Topics
Partially Color Histogram in R
Extracting a Random Sample of Rows in a Data.Frame with a Nested Conditional
First Day of the Month from a Posixct Date Time Using Lubridate
Using R to Download Newest Files from Ftp-Server
How to Jitter Both Geom_Line and Geom_Point by the Same Magnitude
Handle Continuous Missing Values in Time-Series Data
Calculate Monthly Average of Ts Object
How Does R Handle Unicode/Utf-8
How to Plot a Heat Map on a Spatial Map
Add a New Column Between Other Dataframe Columns
How to Sort a Character Vector According to a Specific Order
Extract First Word from a Column and Insert into New Column
Shiny Dashboard - Display a Dedicated "Loading.." Page Until Initial Loading of the Data Is Done
Trouble Passing on an Argument to Function Within Own Function