Subset Observations That Differ by at Least 30 Minutes Time

Subset observations that differ by at least 30 minutes time

Here's what I would do:

setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data

s = 0L
w = DT[, .I[1L], by=id]$V1

while (length(w)){
s = s + 1L
DT[w, tag := s]

m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}

which gives

               datetime          x id  keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4

The idea behind it is described by the OP in a comment:

per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on

Subset observations that differ by at least 30 minutes time

Here's what I would do:

setDT(DT, key=c("id","datetime")) # invalid selfref with the OP's example data

s = 0L
w = DT[, .I[1L], by=id]$V1

while (length(w)){
s = s + 1L
DT[w, tag := s]

m = DT[w, .(id, datetime = datetime+30*60)]
w = DT[m, which = TRUE, roll=-Inf]
w = w[!is.na(w)]
}

which gives

               datetime          x id  keep tag
1: 2016-04-28 10:20:18 0.02461368 1 TRUE 1
2: 2016-04-28 10:41:34 0.88953932 1 FALSE NA
3: 2016-04-28 10:46:07 0.31818101 1 FALSE NA
4: 2016-04-28 11:00:56 0.14711365 1 TRUE 2
5: 2016-04-28 11:09:11 0.54406602 1 FALSE NA
6: 2016-04-28 11:39:09 0.69280341 1 TRUE 3
7: 2016-04-28 11:50:01 0.99426978 1 FALSE NA
8: 2016-04-28 11:51:46 0.47779597 1 FALSE NA
9: 2016-04-28 11:57:58 0.23162579 1 FALSE NA
10: 2016-04-28 11:58:23 0.96302423 1 FALSE NA
11: 2016-04-28 10:13:19 0.21640794 2 TRUE 1
12: 2016-04-28 10:13:44 0.70853047 2 FALSE NA
13: 2016-04-28 10:36:44 0.75845954 2 FALSE NA
14: 2016-04-28 10:55:31 0.64050681 2 TRUE 2
15: 2016-04-28 11:00:33 0.90229905 2 FALSE NA
16: 2016-04-28 11:11:51 0.28915974 2 FALSE NA
17: 2016-04-28 11:14:14 0.79546742 2 FALSE NA
18: 2016-04-28 11:26:17 0.69070528 2 TRUE 3
19: 2016-04-28 11:51:02 0.59414202 2 FALSE NA
20: 2016-04-28 11:56:36 0.65570580 2 TRUE 4

The idea behind it is described by the OP in a comment:

per id the first row is always kept. The next row that is at least 30 minutes after the first shall also be kept. Let's assume that row to be kept is row 4. Then, compute time differences between row 4 and rows 5:n and keep the first that differs by more than 30 mins and so on

Filter rows by a time threshold

There may be a more elegant way to do it, but this works:

library(dplyr)

isHourApart <- function(dt) {
min <- 0
keeps <- c()
for (d in dt) {
if (d >= min + 60 * 60) {
min <- d
keeps <- c(keeps, TRUE)
} else {
keeps <- c(keeps, FALSE)
}
}
keeps
}

df %>%
group_by(Species) %>%
filter(isHourApart(DateTime))

> df
# A tibble: 5 x 3
# Groups: Species [2]
ID Species DateTime
<chr> <fct> <dttm>
1 P1 A 2015-03-16 18:42:00
2 P3 A 2015-03-16 19:58:00
3 P4 A 2015-03-16 21:02:00
4 P5 B 2015-03-16 21:18:00
5 P9 B 2015-03-16 23:43:00

Note that the DateTime column is of class POSIXct.

r - Adaptive division in time intervals for a set of observations

Here is a solution, with a good old for loop:

df$difftime <- c(0, diff(df$DateTime)/60)

df$group <- 1
time_in_group <- 0
for (i in seq.int(2, nrow(df))) {
time_in_group <- time_in_group + df$difftime[i]
if(time_in_group < 10)
df$group[i] <- df$group[i-1]
else {
time_in_group <- 0
df$group[i] <- 1 + df$group[i-1]
}
}

Identify discrete events based on a time difference of 30 minutes or more in R

Here is an option

library(tidyverse)
df %>%
mutate(
timestampUTC = as.POSIXct(timestampUTC),
diff = c(0, diff(timestampUTC) / 60),
grp = cumsum(diff > 30)) %>%
group_by(grp) %>%
summarise(
ID = first(ID),
location = first(location),
`event start` = first(timestampUTC),
`event end` = last(timestampUTC))
## A tibble: 7 x 5
# grp ID location `event start` `event end`
# <int> <fct> <fct> <dttm> <dttm>
#1 0 A69-1601-47272 JB12 2017-10-02 19:23:27 2017-10-02 19:31:46
#2 1 A69-1601-47272 JB12 2017-10-02 23:52:15 2017-10-02 23:55:13
#3 2 A69-1601-47272 JB13 2017-10-03 19:53:50 2017-10-03 19:58:26
#4 3 A69-1601-47280 JB12 2017-10-04 13:15:13 2017-10-04 13:21:39
#5 4 A69-1601-47280 JB12 2017-10-04 19:34:54 2017-10-04 20:21:43
#6 5 A69-1601-47280 JB13 2017-10-05 04:55:48 2017-10-05 05:18:40
#7 6 A69-1601-47280 JB13 2017-10-07 21:24:19 2017-10-07 21:29:25

I've kept some some of the intermediate steps (columns) to help with readability and understanding. In short, we convert timestamps to POSIXct, then calculate time differences in minutes between successive timestamps with diff, create groups of observations based on whether the next timestamp is > 30 minutes away. The rest is grouping by grp and summarising entries from relevant columns.


The same, more succinct (perhaps at the expense of readability)

df %>%
group_by(grp = cumsum(c(0, diff(as.POSIXct(timestampUTC)) / 60) > 30)) %>%
summarise(
ID = first(ID),
location = first(location),
`event start` = first(timestampUTC),
`event end` = last(timestampUTC)) %>%
select(-grp)

Sample data

df <- read.table(text =
"timestampUTC location ID
'2017-10-02 19:23:27' JB12 A69-1601-47272
'2017-10-02 19:26:48' JB12 A69-1601-47272
'2017-10-02 19:27:23' JB12 A69-1601-47272
'2017-10-02 19:31:46' JB12 A69-1601-47272
'2017-10-02 23:52:15' JB12 A69-1601-47272
'2017-10-02 23:53:26' JB12 A69-1601-47272
'2017-10-02 23:55:13' JB12 A69-1601-47272
'2017-10-03 19:53:50' JB13 A69-1601-47272
'2017-10-03 19:55:23' JB13 A69-1601-47272
'2017-10-03 19:58:26' JB13 A69-1601-47272
'2017-10-04 13:15:13' JB12 A69-1601-47280
'2017-10-04 13:16:42' JB12 A69-1601-47280
'2017-10-04 13:21:39' JB12 A69-1601-47280
'2017-10-04 19:34:54' JB12 A69-1601-47280
'2017-10-04 19:55:28' JB12 A69-1601-47280
'2017-10-04 20:08:23' JB12 A69-1601-47280
'2017-10-04 20:21:43' JB12 A69-1601-47280
'2017-10-05 04:55:48' JB13 A69-1601-47280
'2017-10-05 04:57:04' JB13 A69-1601-47280
'2017-10-05 05:18:40' JB13 A69-1601-47280
'2017-10-07 21:24:19' JB13 A69-1601-47280
'2017-10-07 21:25:36' JB13 A69-1601-47280
'2017-10-07 21:29:25' JB13 A69-1601-47280", header = T)

data.table time subset vs xts time subset

If you're ok with specifying your range in UTC, you can do:

j[(.index(j) %% 86400) %between% c(10*3600, 16*3600 + 60)]
# +60 because xts includes that minute; you'll need to offset the times
# appropriately to match with xts unless you live in UTC :)

j <- xts(rnorm(10e6),Sys.time()-(10e6:1))
system.time(j[(.index(j) %% 86400) %between% c(10*3600, 16*3600 + 60)])
# user system elapsed
# 1.17 0.08 1.25
# likely faster on your machine as mine takes minutes to run the OP bench

is there an R function to filter a dataset in 15 secs interval?

One potential solution is with dplyr - though I am sure there may be better options available especially with data.table. As suggested by @42- and demonstrated by @Maurits Evers, you can do the following:

library(dplyr)

d_cor %>%
arrange(Time) %>%
mutate(
diff = abs(lag(Time) - Time),
diff = ifelse(is.na(diff), 0, diff),
cumdiff = cumsum(diff) %/% 15,
x = abs(lag(cumdiff) - cumdiff)) %>%
filter(is.na(x) | x > 0) %>%
select(Depth, Time)

Depth Time
1 0.1 2018-06-24 01:26:40
2 0.2 2018-06-24 01:26:56
3 0.1 2018-06-24 01:27:14
4 0.1 2018-06-24 01:27:30

diff will include the difference between times in seconds between consecutive rows. The first row would be NA (changed to 0).

cumdiff is the cumulative sum of diff but after modulo division by 15 (cumdiff increases by 1 after at least every 15 seconds).

The filter will include the first row (x = NA) and additional rows where cumdiff changes (at rows where at least 15 seconds lapsed).

Other examples that may be helpful and include data.table:

Filter rows by a time threshold

Subset observations that differ by at least 30 minutes time

Subset time series so that selected rows differs by a certain minimum time

Edit: This solution looks for times in fixed 15 second windows. There are problems related to diffs greater than 15. For those cases, it does not 'reset' and start a new 15 second window. Instead, it would include that time no matter what 15 second window it was in. Because of this we could potentially find times close to each other especially right afterwards.



Related Topics



Leave a reply



Submit