In R, How to Split Timestamp Interval Data into Regular Slots

In R, how do I split timestamp interval data into regular slots?

You might also approach this by thinking of each start_time as adding one active event and each end_time as reducing active events by one. This approach lets you identify the active events at any given instant, and it scales well. (I've used something similar to count millions of events and it's basically instantaneous.)

df2 <- df1 %>%
gather(type, time, start_date:end_date) %>%
mutate(event_chg = if_else(type == "start_date", 1, -1)) %>%
arrange(time) %>%
mutate(active_events = cumsum(event_chg))

df2
# A tibble: 4 x 5
# id type time event_chg active_events
# <dbl> <chr> <dttm> <dbl> <dbl>
#1 2 start_date 2018-12-10 13:29:37 1 1
#2 2 end_date 2018-12-10 14:02:37 -1 0
#3 1 start_date 2018-12-10 14:45:51 1 1
#4 1 end_date 2018-12-10 14:59:04 -1 0

ggplot(df2, aes(time, active_events)) + geom_step()

Sample Image

If you want to also assess the active count at regular intervals, you could integrate those intervals into your output data frame like this:

df2b <- df1 %>%
gather(type, time, start_date:end_date) %>%
mutate(event_chg = if_else(type == "start_date", 1, -1)) %>%
# NEW SECTION HERE
bind_rows(data_frame(type = "marker",
time = seq.POSIXt(ymd_h(2018121013, tz = "Australia/Brisbane"),
ymd_h(2018121016, tz = "Australia/Brisbane"),
by = 15*60), # 15 minutes of seconds = 15*60
event_chg = 0)) %>%
# END OF NEW SECTION
arrange(time) %>%
mutate(active_events = cumsum(event_chg))

Then it's possible to plot those counts directly, or filter the output data frame to see them. In this case, event id 1 occurred entirely between two 15-minute intervals.

ggplot(df2b, aes(time, active_events, label = active_events)) + 
geom_step() +
geom_point(data = df2b %>% filter(type == "marker")) +
geom_text(data = df2b %>% filter(type == "marker"), vjust = -0.5)

Sample Image

In R, how do I split & aggregate timestamp interval data with IDs into regular slots?

A tidy solution could be achieved using the tsibble package.

library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#> method from
#> [.quosures rlang
#> c.quosures rlang
#> print.quosures rlang
#> Registered S3 method overwritten by 'rvest':
#> method from
#> read_xml.response xml2
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(tsibble, warn.conflicts = FALSE)

df1 <- tibble(
id = c("a", "b", "c", "c", "c", "d", "e"),
start = c(ymd_hms("2018-12-10 13:01:00"),
ymd_hms("2018-12-10 13:07:00"),
ymd_hms("2018-12-10 14:45:00"),
ymd_hms("2018-12-10 14:48:00"),
ymd_hms("2018-12-10 14:52:00"),
ymd_hms("2018-12-10 14:45:00"),
ymd_hms("2018-12-10 14:45:00")),
end = c(ymd_hms("2018-12-10 13:05:00"),
ymd_hms("2018-12-10 13:17:00"),
ymd_hms("2018-12-10 14:46:00"),
ymd_hms("2018-12-10 14:50:00"),
ymd_hms("2018-12-10 15:01:00"),
ymd_hms("2018-12-10 14:51:00"),
ymd_hms("2018-12-10 15:59:00")))

df1 %>%
mutate(
start = floor_date(start, "15 mins"),
end = floor_date(end, "15 mins")
) %>%
gather("label", "index", start:end) %>%
distinct(id, index) %>%
mutate(date = as_date(index)) %>%
as_tsibble(key = c(id, date), index = index) %>%
fill_gaps() %>%
index_by(index) %>%
summarise(users_mac = n())
#> # A tsibble: 7 x 2 [15m] <UTC>
#> index users_mac
#> <dttm> <int>
#> 1 2018-12-10 13:00:00 2
#> 2 2018-12-10 13:15:00 1
#> 3 2018-12-10 14:45:00 3
#> 4 2018-12-10 15:00:00 2
#> 5 2018-12-10 15:15:00 1
#> 6 2018-12-10 15:30:00 1
#> 7 2018-12-10 15:45:00 1

Created on 2019-05-17 by the reprex package (v0.2.1)

Cut timestamp into numeric slots in R

Considering your dataframe as df we can use cut with breaks of 2 hours.

df$slotnumber <- cut(strptime(df$x, "%H:%M:%S"), breaks = "2 hours", 
labels = paste0("slot", 1:12))

# x slotnumber
#1 01:59:59 slot1
#2 03:59:59 slot2
#3 05:59:59 slot3
#4 07:59:59 slot4
#5 09:59:59 slot5
#6 11:59:59 slot6
#7 13:59:59 slot7
#8 15:59:59 slot8
#9 17:59:59 slot9
#10 19:59:59 slot10
#11 21:59:59 slot11
#12 23:59:59 slot12

data

df <- data.frame(x)

Speeding up a double loop R

Your data is a bit small (and incomplete) to work with, so I've generated my own:

library(tidyverse)
library(lubridate)
library(rlang)
n_items <- 1e6
sample_data <- tibble(
arrival_time = make_date(2018, 11, floor(runif(n_items, 1, 31))) +
dhours(9) + dseconds(floor(runif(n_items, 0, 6 * 60 * 60 + 1))),
attended_time = arrival_time +
dseconds(floor(runif(n_items, 0, 60 * 60 + 1))),
branch_type_client = sample(LETTERS, n_items, replace = TRUE)
)

Now we need to work out the number of people arriving and attending by each whole minute. I'm assuming someone arriving at 13:05:01 doesn't get counted until 13:06:00.

arrived <- sample_data %>% 
count(branch_type_client, time = ceiling_date(arrival_time, "minutes"))

attended <- sample_data %>%
count(branch_type_client, time = ceiling_date(attended_time, "minutes"))

Now we join the two together, fill in all of the desired date sequence, and then work out the cumulative difference between arrived and attended.

all_times <- rep(seq(ymd("2018-11-01"), ymd("2018-11-30"), by = "1 day"), each = 7 * 60 + 1) +
dhours(9) + rep(dminutes(0:(60 * 7)), 30)

queue <- full_join(arrived, attended, by = c("branch_type_client", "time"),
suffix = c("_arrived", "_attended")) %>%
complete(branch_type_client, time = all_times) %>%
replace_na(list(n_arrived = 0, n_attended = 0)) %>%
arrange(branch_type_client, time) %>%
group_by(branch_type_client) %>%
mutate(queue_length = cumsum(n_arrived - n_attended))

If you want one column per branch_type_client, you can use tidyr::spread:

queue_wide <- queue %>% 
select(time, branch_type_client, queue_length) %>%
spread(branch_type_client, queue_length)

This whole thing (including generating the sample data of a million rows) takes about 6 seconds on my 7-year-old laptop with no parallelisation.

Summarize values for overlapping time periods

Here's a tidyverse solution similar to my response to this recent question. I gather to bring the timestamps (Starts and Stops) into one column, with another column specifying which. The Starts add the value and the Stops subtract it, and then we just take the cumulative sum to get values at all the instants when the sum changes.

For 100 records, there won't be any perceivable speed improvement from using data.table; in my experience it starts to make more of a difference around 1M records, especially when grouping is involved.

library(dplyr); library(tidyr)
df2 <- df %>%
gather(type, time, Start:Stop) %>%
mutate(chg = if_else(type == "Start", Value, -Value)) %>%
arrange(time) %>%
mutate(sum = cumsum(chg)) # EDIT: corrected per OP comment

> head(df2)
## A tibble: 6 x 5
# Value type time chg sum
# <dbl> <chr> <dttm> <dbl> <dbl>
#1 1500 Start 1987-04-27 14:13:20 1500 1500
#2 4200 Stop 2019-01-02 08:00:00 -4200 -2700
#3 12610 Start 2019-01-03 16:00:00 12610 9910
#4 520 Start 2019-01-03 16:00:00 520 10430
#5 4200 Start 2019-01-03 16:00:00 4200 14630
#6 27300 Start 2019-01-03 17:00:00 27300 41930

Rounding time to nearest quarter hour

Indeed, an old question with some helpful answers so far. The last one by giraffhere seems to be the most elegant. however, ist not floor_date but round_date which will do the trick:

lubridate::round_date(x, "15 minutes") 

How to count the number of unique labels belonging to one particular column with respect to timestamp by interval of x minutes?

Here is a tidyverse solution:

# Create 2 min breakpoints by which we group times
hm <- function(x) as.POSIXct(x, format = "%H:%M")
breaks <- seq(min(hm(x$timeStamp)), max(hm(x$timeStamp)) + 120, by = '2 min');

library(tidyverse);
x %>%
mutate(
timeStamp = cut(hm(timeStamp), breaks = breaks)) %>%
count(timeStamp, label) %>%
spread(label, n)
## A tibble: 3 x 4
# timeStamp `001_T09_Submit Pa… `002_T05_SearchPat… `003_T04_Ward Lo…
# <fct> <int> <int> <int>
#1 2018-04-13 20:12:00 NA 2 4
#2 2018-04-13 20:14:00 4 4 2
#3 2018-04-13 20:16:00 NA 2 2

Explanation: We create 2 min breakpoints by which we cut the hour+minute component of timeStamp; then count by 2 min-grouped times and label, and spread from long to wide.


Sample data

x <- data.frame(
timeStamp = c("20:12:14","20:12:14","20:13:02","20:13:02","20:13:55","20:13:55","20:14:14","20:14:14","20:14:25","20:14:26","20:14:26","20:14:26","20:15:26","20:15:28","20:15:36","20:15:37","20:16:41","20:16:49","20:17:20","20:17:21"),
label = c("003_T04_Ward Login","003_T04_Ward Login","002_T05_SearchPatient","002_T05_SearchPatient","003_T04_Ward Login","003_T04_Ward Login","002_T05_SearchPatient","002_T05_SearchPatient","001_T09_Submit Payment","001_T09_Submit Payment","001_T09_Submit Payment","001_T09_Submit Payment","002_T05_SearchPatient","002_T05_SearchPatient","003_T04_Ward Login","003_T04_Ward Login","002_T05_SearchPatient","002_T05_SearchPatient","003_T04_Ward Login","003_T04_Ward Login" ))


Related Topics



Leave a reply



Submit