Subset observations that differ by at least 30 minutes time
Here's what I would do:
setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
s = 0L
w = DT[, .I[1L], by=id]$V1
while (length(w)){
s = s + 1L
DT[w, tag := s]
m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}
which gives
datetime x id keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4
The idea behind it is described by the OP in a comment:
per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on
Subset observations that differ by at least 30 minutes time
Here's what I would do:
setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data
s = 0L
w = DT[, .I[1L], by=id]$V1
while (length(w)){
s = s + 1L
DT[w, tag := s]
m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}
which gives
datetime x id keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4
The idea behind it is described by the OP in a comment:
per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on
Filter rows by a time threshold
There may be a more elegant way to do it, but this works:
library(dplyr)
isHourApart <- function(dt) {
min <- 0
keeps <- c()
for (d in dt) {
if (d >= min + 60 * 60) {
min <- d
keeps <- c(keeps, TRUE)
} else {
keeps <- c(keeps, FALSE)
}
}
keeps
}
df %>%
group_by(Species) %>%
filter(isHourApart(DateTime))
> df
# A tibble: 5 x 3
# Groups: Species [2]
ID Species DateTime
<chr> <fct> <dttm>
1 P1 A 2015-03-16 18:42:00
2 P3 A 2015-03-16 19:58:00
3 P4 A 2015-03-16 21:02:00
4 P5 B 2015-03-16 21:18:00
5 P9 B 2015-03-16 23:43:00
Note that the DateTime column is of class POSIXct.
r - Adaptive division in time intervals for a set of observations
Here is a solution, with a good old for loop:
df$difftime <- c(0, diff(df$DateTime)/60)
df$group <- 1
time_in_group <- 0
for (i in seq.int(2, nrow(df))) {
time_in_group <- time_in_group + df$difftime[i]
if(time_in_group < 10)
df$group[i] <- df$group[i-1]
else {
time_in_group <- 0
df$group[i] <- 1 + df$group[i-1]
}
}
Identify discrete events based on a time difference of 30 minutes or more in R
Here is an option
library(tidyverse)
df %>%
mutate(
timestampUTC = as.POSIXct(timestampUTC),
diff = c(0, diff(timestampUTC) / 60),
grp = cumsum(diff > 30)) %>%
group_by(grp) %>%
summarise(
ID = first(ID),
location = first(location),
`event start` = first(timestampUTC),
`event end` = last(timestampUTC))
## A tibble: 7 x 5
# grp ID location `event start` `event end`
# <int> <fct> <fct> <dttm> <dttm>
#1 0 A69-1601-47272 JB12 2017-10-02 19:23:27 2017-10-02 19:31:46
#2 1 A69-1601-47272 JB12 2017-10-02 23:52:15 2017-10-02 23:55:13
#3 2 A69-1601-47272 JB13 2017-10-03 19:53:50 2017-10-03 19:58:26
#4 3 A69-1601-47280 JB12 2017-10-04 13:15:13 2017-10-04 13:21:39
#5 4 A69-1601-47280 JB12 2017-10-04 19:34:54 2017-10-04 20:21:43
#6 5 A69-1601-47280 JB13 2017-10-05 04:55:48 2017-10-05 05:18:40
#7 6 A69-1601-47280 JB13 2017-10-07 21:24:19 2017-10-07 21:29:25
I've kept some some of the intermediate steps (columns) to help with readability and understanding. In short, we convert timestamps to POSIXct
, then calculate time differences in minutes between successive timestamps with diff
, create groups of observations based on whether the next timestamp is > 30
minutes away. The rest is grouping by grp
and summarising entries from relevant columns.
The same, more succinct (perhaps at the expense of readability)
df %>%
group_by(grp = cumsum(c(0, diff(as.POSIXct(timestampUTC)) / 60) > 30)) %>%
summarise(
ID = first(ID),
location = first(location),
`event start` = first(timestampUTC),
`event end` = last(timestampUTC)) %>%
select(-grp)
Sample data
df <- read.table(text =
"timestampUTC location ID
'2017-10-02 19:23:27' JB12 A69-1601-47272
'2017-10-02 19:26:48' JB12 A69-1601-47272
'2017-10-02 19:27:23' JB12 A69-1601-47272
'2017-10-02 19:31:46' JB12 A69-1601-47272
'2017-10-02 23:52:15' JB12 A69-1601-47272
'2017-10-02 23:53:26' JB12 A69-1601-47272
'2017-10-02 23:55:13' JB12 A69-1601-47272
'2017-10-03 19:53:50' JB13 A69-1601-47272
'2017-10-03 19:55:23' JB13 A69-1601-47272
'2017-10-03 19:58:26' JB13 A69-1601-47272
'2017-10-04 13:15:13' JB12 A69-1601-47280
'2017-10-04 13:16:42' JB12 A69-1601-47280
'2017-10-04 13:21:39' JB12 A69-1601-47280
'2017-10-04 19:34:54' JB12 A69-1601-47280
'2017-10-04 19:55:28' JB12 A69-1601-47280
'2017-10-04 20:08:23' JB12 A69-1601-47280
'2017-10-04 20:21:43' JB12 A69-1601-47280
'2017-10-05 04:55:48' JB13 A69-1601-47280
'2017-10-05 04:57:04' JB13 A69-1601-47280
'2017-10-05 05:18:40' JB13 A69-1601-47280
'2017-10-07 21:24:19' JB13 A69-1601-47280
'2017-10-07 21:25:36' JB13 A69-1601-47280
'2017-10-07 21:29:25' JB13 A69-1601-47280", header = T)
data.table time subset vs xts time subset
If you're ok with specifying your range in UTC
, you can do:
j[(.index(j) %% 86400) %between% c(10*3600, 16*3600 + 60)]
# +60 because xts includes that minute; you'll need to offset the times
# appropriately to match with xts unless you live in UTC :)
j <- xts(rnorm(10e6),Sys.time()-(10e6:1))
system.time(j[(.index(j) %% 86400) %between% c(10*3600, 16*3600 + 60)])
# user system elapsed
# 1.17 0.08 1.25
# likely faster on your machine as mine takes minutes to run the OP bench
is there an R function to filter a dataset in 15 secs interval?
One potential solution is with dplyr
- though I am sure there may be better options available especially with data.table
. As suggested by @42- and demonstrated by @Maurits Evers, you can do the following:
library(dplyr)
d_cor %>%
arrange(Time) %>%
mutate(
diff = abs(lag(Time) - Time),
diff = ifelse(is.na(diff), 0, diff),
cumdiff = cumsum(diff) %/% 15,
x = abs(lag(cumdiff) - cumdiff)) %>%
filter(is.na(x) | x > 0) %>%
select(Depth, Time)
Depth Time
1 0.1 2018-06-24 01:26:40
2 0.2 2018-06-24 01:26:56
3 0.1 2018-06-24 01:27:14
4 0.1 2018-06-24 01:27:30
diff
will include the difference between times in seconds between consecutive rows. The first row would be NA
(changed to 0).
cumdiff
is the cumulative sum of diff
but after modulo division by 15 (cumdiff
increases by 1 after at least every 15 seconds).
The filter will include the first row (x = NA
) and additional rows where cumdiff
changes (at rows where at least 15 seconds lapsed).
Other examples that may be helpful and include data.table
:
Filter rows by a time threshold
Subset observations that differ by at least 30 minutes time
Subset time series so that selected rows differs by a certain minimum time
Edit: This solution looks for times in fixed 15 second windows. There are problems related to diffs greater than 15. For those cases, it does not 'reset' and start a new 15 second window. Instead, it would include that time no matter what 15 second window it was in. Because of this we could potentially find times close to each other especially right afterwards.
Related Topics
R: How to Select Files in Directory Which Satisfy Conditions Both on the Beginning and End of Name
Using Get Inside Lapply, Inside a Function
How to Increase the Space Between Grouped Bars in Ggplot2
Using Strsplit and Subset in Dplyr and Mutate
Grid.Arrange Using List of Plots
Specifying the Scale for the Density in Ggplot2's Stat_Density2D
Have Lubridate Subtraction Return Only a Numeric Value
Replace Blank Cells with Character
Create Multilines from Points, Grouped by Id with Sf Package
Import Multiple Text Files in R and Assign Them Names from a Predetermined List
Find *All* Duplicated Records in Data.Table (Not All-But-One)
R Markdown - Format Text in Code Chunk with New Lines
Check If Value Is in Data Frame
R: How to Make a Barplot with Labels Parallel (Horizontal) to Bars
Difference Between Installing a Package from Source and from Compiled Binary
R - How to Add Row Index to a Data Frame, Based on Combination of Factors