Creating Groups of Consecutive Days Meeting a Given Criteria

Creating groups of consecutive days meeting a given criteria

In this answer, I'll assume that the "id" field numbers the rows consecutively when sorted by increasing date, like it does in the example data. (Such a column can be created if it does not exist).

This is an example of a technique described here and here.

1) Join the table to itself on adjacent "id" values. This pairs adjacent rows. Select rows where the "allocation" field has changed. Store the result in a temporary table, also keeping a running index.

SET @idx = 0;
CREATE TEMPORARY TABLE boundaries
SELECT
(@idx := @idx + 1) AS idx,
a1.date AS prev_end,
a2.date AS next_start,
a1.allocation as allocation
FROM allocations a1
JOIN allocations a2
ON (a2.id = a1.id + 1)
WHERE a1.allocation != a2.allocation;

This gives you a table having "the end of the previous period", "the start of the next period", and "the value of 'allocation' in the previous period" in each row:

+------+------------+------------+------------+
| idx | prev_end | next_start | allocation |
+------+------------+------------+------------+
| 1 | 2012-01-01 | 2012-01-02 | 0 |
| 2 | 2012-01-02 | 2012-01-03 | 2 |
| 3 | 2012-01-05 | 2012-01-06 | 0 |
+------+------------+------------+------------+

2) We need the start and end of each period in the same row, so we need to combine adjacent rows again. Do this by creating a second temporary table like boundaries but having an idx field 1 greater:

+------+------------+------------+
| idx | prev_end | next_start |
+------+------------+------------+
| 2 | 2012-01-01 | 2012-01-02 |
| 3 | 2012-01-02 | 2012-01-03 |
| 4 | 2012-01-05 | 2012-01-06 |
+------+------------+------------+

Now join on the idx field and we get the answer:

SELECT
boundaries2.next_start AS start,
boundaries.prev_end AS end,
allocation
FROM boundaries
JOIN boundaries2
USING(idx);

+------------+------------+------------+
| start | end | allocation |
+------------+------------+------------+
| 2012-01-02 | 2012-01-02 | 2 |
| 2012-01-03 | 2012-01-05 | 0 |
+------------+------------+------------+

** Note that this answer gets the "internal" periods correctly but misses the two "edge" periods where allocation = 0 at the beginning and allocation = 5 at the end. Those can be pulled in using UNION clauses but I wanted to present the core idea without that complication.

Get count of consecutive days meeting a given criteria

This query will yield the counts for each row:

SELECT allocation, d, count(*) OVER (PARTITION BY allocation, part ORDER BY d) AS c
FROM (
SELECT allocation, d,
d - row_number() OVER (PARTITION BY allocation ORDER BY d) AS part
FROM t
)
ORDER BY d;

You can then filter on it to find the counts for a given row:

SELECT c
FROM (
SELECT allocation, d, count(*) OVER (PARTITION BY allocation, part ORDER BY d) AS c
FROM (
SELECT allocation, d,
d - row_number() OVER (PARTITION BY allocation ORDER BY d) AS part
FROM t
)
)
WHERE d = DATE '2015-01-05';

Explanation:

The derived table is used to calculate different "partitions" part for each date and allocation:

  SELECT allocation, d,
d - row_number() OVER (PARTITION BY allocation ORDER BY d) AS part
FROM t

The result is:

allocation  d           part
--------------------------------
Same 01.01.15 31.12.14
Good 02.01.15 01.01.15
Same 03.01.15 01.01.15
Same 04.01.15 01.01.15
Same 05.01.15 01.01.15
Good 06.01.15 04.01.15

The concrete date produced by part is irrelevant. It's just some date that will be the same for each "group" of dates within an allocation. You can then count the number of identical values of (allocation, part) using the count(*) over(...) window function:

SELECT allocation, d, count(*) OVER (PARTITION BY allocation, part ORDER BY d) AS c
FROM (...)
ORDER BY d;

to produce your wanted result.

Data

I've used the following table for the example:

CREATE TABLE t AS (
SELECT DATE '2015-01-01' AS d, 'Same' AS allocation FROM dual UNION ALL
SELECT DATE '2015-01-02' AS d, 'Good' AS allocation FROM dual UNION ALL
SELECT DATE '2015-01-03' AS d, 'Same' AS allocation FROM dual UNION ALL
SELECT DATE '2015-01-04' AS d, 'Same' AS allocation FROM dual UNION ALL
SELECT DATE '2015-01-05' AS d, 'Same' AS allocation FROM dual UNION ALL
SELECT DATE '2015-01-06' AS d, 'Good' AS allocation FROM dual
);

Create ID for specific sequence of consecutive days based on grouping variable in R

Try:

library(dplyr)

mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L)

Output:

# A tibble: 5 x 2
# Groups: country [2]
seq.ID country
<int> <fct>
1 1 Angola
2 2 Angola
3 1 Benin
4 2 Benin
5 3 Benin

You can also use the .keep_all argument in distinct and preserve the first date of each sequence:

mydata %>%
group_by(country) %>%
distinct(seq.ID = cumsum(event_date != lag(event_date, default = first(event_date)) + 1L),
.keep_all = TRUE)

# A tibble: 5 x 3
# Groups: country [2]
country event_date seq.ID
<fct> <date> <int>
1 Angola 2017-06-16 1
2 Angola 2017-08-22 2
3 Benin 2019-04-18 1
4 Benin 2018-03-15 2
5 Benin 2016-03-17 3

In case of desired non-aggregated output with different sequence IDs, you could do:

mydata %>%
mutate(
seq.ID = cumsum(
(event_date != lag(event_date, default = first(event_date)) + 1L) |
country != lag(country, default = first(country))
)
)

country event_date seq.ID
1 Angola 2017-06-16 1
2 Angola 2017-06-17 1
3 Angola 2017-06-18 1
4 Angola 2017-08-22 2
5 Angola 2017-08-23 2
6 Benin 2019-04-18 3
7 Benin 2019-04-19 3
8 Benin 2019-04-20 3
9 Benin 2018-03-15 4
10 Benin 2018-03-16 4
11 Benin 2016-03-17 5

Note that there is a typo in your last event_date, this is why the outputs don't correspond 100% to your desired output.

Determine consecutive days within a given window

I think this is what you're after - thank you for the table structure and sample data, but I suspect you'll have other edge cases you didn't include:

;WITH src AS -- group in case more than one 'Y' on a day
(
SELECT ID, AlertDate
FROM #t2
WHERE Alert = 'Y'
GROUP BY ID, AlertDate
),
groups AS -- make some islands of consecutive days
(
SELECT ID, AlertDate, day_group = DATEADD(DAY,
1-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY AlertDate), AlertDate)
FROM src
),
agg AS -- get the count and last day in each island
(
SELECT ID, AlertDate,
c = COUNT(*) OVER (PARTITION BY ID, day_group),
m = MAX(AlertDate) OVER (PARTITION BY ID, day_group)
FROM groups
)
SELECT t1.ID, t1.SpecDate, ConsecutiveAlertDays = agg.c
FROM agg
INNER JOIN #t1 AS t1
ON agg.m >= DATEADD(DAY, -2, t1.SpecDate)
AND agg.m <= t1.SpecDate
AND t1.ID = agg.ID
AND agg.m = agg.AlertDate
ORDER BY t1.ID, t1.SpecDate;

Results:






















IDSpecDateConsecutiveAlertDays
A2021-05-103
B2021-05-101

Manipulating dates alongside consecutive results

A dplyr based solution can be achieved by creating a group of consecutive occurrence of result column and then finally taking 1st occurrence that meets criteria:

library(dplyr)
df %>% mutate(samp_date = as.Date(samp_date)) %>%
group_by(idno) %>%
arrange(samp_date) %>%
mutate(result_grp = cumsum(as.character(result)!=lag(as.character(result),default=""))) %>%
group_by(idno, result_grp) %>%
filter( result == "Negative" & (max(samp_date) - min(samp_date) )>=30) %>%
slice(1) %>%
ungroup() %>%
select(-result_grp)

# # A tibble: 2 x 3
# idno result samp_date
# <dbl> <ord> <date>
# 1 1.00 Negative 2013-10-29
# 2 2.00 Negative 2015-11-06

Group a (sometimes non consecutive) period of days in Oracle SQL

I thin this should work and also be reasonable efficient (it should hit the table only once)

create table t ( d date, v number);

insert into t values (trunc(sysdate), 100);
insert into t values (trunc(sysdate+2), 100);
insert into t values (trunc(sysdate+3), 100);
insert into t values (trunc(sysdate+4), 100);
insert into t values (trunc(sysdate+5), 200);
insert into t values (trunc(sysdate+6), 200);
insert into t values (trunc(sysdate+7), 200);
insert into t values (trunc(sysdate+8), 100);

select min(d), max(d), v
from (
select d, v,
sum( gc) over (partition by v order by d) g
from (
select d, v,
(case (d - lag(d) over ( partition by v order by d) )
when 1 then 0
else 1
end) gc
from t
)
) group by v, g
order by min(d), v

Note that if you want to run the logic in an efficient way on a subset of your data, you should add the where clause in the inner most select. Otherwise oracle will have problems to use any index.

How to check if conditions are met in consecutive days (in R)

Use packages lubridate and dplyr to group hours by day

This should work if you replace your_data with whatever you've called your data.

Are you your conditions are correct? Feels like it should be 80 items sold over two days and at least three hours in which over 7 sales are made?

Note that your problem is a little ambiguous - I've treated it as 'a total of n sales over two days', rather than 'at least n sales per day for two days'. If you want the latter, you'll need to tweak this code.

The %>% operator here is a pipe - it takes whatever precedes it and makes it the first argument of the next function. So x %>% f(y = 2) is equivalent to f(x, y = 2)

# You'll need these packages
require(dplyr)
require(lubridate)

### Creating a 'daily sales' column
your_data %>%
## Group the hours into days w/ lubridate::ymd()
group_by(lubridate::ymd(date_time)) %>%
## Create two new columns: items sold per day, and total hours
## in the day that sales exceeded 80
mutate(sales_today = sum(items_sold),
hours_exceeding = sum(items_sold >= 80)) %>%
# ungroup and overwrite
ungroup -> your_data

### Get columns for previous two days
your_data %>%
group_by(lubridate::ymd(date_time)) %>% # still 'thinking in days'
## 'shift' existing columns to compare each day with previous two
mutate(sales_yesterday = lag(sales_today, n = 1),
sales_day_before = lag(sales_today, n = 2),
hours_exceeding_yesterday = lag(hours_exceeding, n = 1),
hours_exceeding_day_before = lag(hours_exceeding, n = 2)
) %>%
## Use new columns to get totals for previous two days
mutate(sales_last_two_days =
sales_yesterday + sales_day_before,
hours_exceeding_last_two_days =
hours_exceeding_yesterday + hours_exceeding_day_before) %>%
## Drop unwanted columns
select(-sales_yesterday, -sales_day_before,
-hours_exceeding_yesterday, -hours_exceeding_day_before) %>%
## Create a logical which indicates that the criteria are met
mutate(meets_criteria =
hours_exceeding_last_two_days >= 3 &
sales_last_two_days >= 7) %>%
ungroup -> your_data

How to subset consecutive rows if they meet a condition

An approach with data.table which is slightly different from @jlhoward's approach (using the same data):

library(data.table)

setDT(df)
df[, hotday := +(MAX>=44.5 & MIN>=24.5)
][, hw.length := with(rle(hotday), rep(lengths,lengths))
][hotday == 0, hw.length := 0]

this produces a datatable with a heat wave length variable (hw.length) instead of a TRUE/FALSE variable for a specific heat wave length:

> df
YEAR MONTH DAY MAX MIN hotday hw.length
1: 1989 7 18 45.0 23.5 0 0
2: 1989 7 19 44.2 26.1 0 0
3: 1989 7 20 44.7 24.4 0 0
4: 1989 7 21 44.6 29.5 1 1
5: 1989 7 22 44.4 31.6 0 0
6: 1989 7 23 44.2 26.7 0 0
7: 1989 7 24 44.5 25.0 1 3
8: 1989 7 25 44.8 26.0 1 3
9: 1989 7 26 44.8 24.6 1 3
10: 1989 7 27 45.0 24.3 0 0
11: 1989 7 28 44.8 26.0 1 1
12: 1989 7 29 44.4 24.0 0 0
13: 1989 7 30 45.2 25.0 1 1


Related Topics



Leave a reply



Submit