How to Group Timestamps into Islands (Based on Arbitrary Gap)

How to group timestamps into islands (based on arbitrary gap)?

This would do it:

SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM (
SELECT done
, (lag(done) OVER (ORDER BY done) <= done - interval '2 min') AS step
FROM tbl
) sub
ORDER BY done;

The subquery sub records step as true if the previous row is at least 2 min away - sorted by the timestamp column done itself in this case.

The outer query adds a rolling count of steps, effectively the group number (grp) - combining the aggregate FILTER clause with another window function.

db<>fiddle here

Related:

  • Query to find all timestamps more than a certain interval apart
  • How to label groups in postgresql when group belonging depends on the preceding line?
  • Select longest continuous sequence
  • Grouping or Window

About the aggregate FILTER clause:

  • How can I simplify this game statistics query?
  • Conditional lead/lag function PostgreSQL?

How to form groups of consecutive dates allowing for a given maximum gap?

Counting the gaps (greater than given tolerance) in a second window function forms the group numbers you are after:

SELECT person_id, contact_day
, count(*) FILTER (WHERE gap > 3) OVER (PARTITION BY person_id ORDER BY contact_day) AS dash_group
FROM (
SELECT person_id, contact_day
, contact_day - lag(contact_day) OVER (PARTITION BY person_id ORDER BY contact_day) AS gap
FROM mydata
) sub
ORDER BY person_id, contact_day; -- optional

db<>fiddle here

About the aggregate FILTER clause:

  • Aggregate columns with additional (distinct) filters

It's short and intuitive, and typically fastest. See:

  • For absolute performance, is SUM faster or COUNT?

It's the classic topic of "gaps and islands". Once you know to look for the tag gaps-and-islands, you'll find plenty of related or nigh identical questions and answers like:

  • Select longest continuous sequence
  • How to group timestamps into islands (based on arbitrary gap)?
  • How to label groups in postgresql when group belonging depends on the preceding line?

Etc.

I tagged accordingly now.

How to merge intervals based on a minimum gap of dates?

This work in Postgres 14. Not tested for Redshift.

SELECT ptid, visit_start_date, visit_end_date, admin_startdate, admin_enddate
, min(admin_startdate) OVER (PARTITION BY visit_id, admin) AS episode_startdate
, max(admin_enddate) OVER (PARTITION BY visit_id, admin) AS episode_enddate
FROM (
SELECT *, count(*) FILTER (WHERE gap) OVER (PARTITION BY visit_id ORDER BY admin_startdate) AS admin
FROM (
SELECT *, admin_startdate - lag(admin_enddate) OVER (PARTITION BY visit_id ORDER BY admin_startdate) > 2 AS gap
FROM (
SELECT *, dense_rank() OVER (ORDER BY ptid, visit_start_date, visit_end_date) AS visit_id -- optional, to simplify
FROM tbl
) sub1
) sub2
) sub3

db<>fiddle here

The innermost subquery sub1 is only to compute a unique visit_id - which should really be in your table instead of repeating (ptid, visit_start_date, visit_end_date ) over and over. Consider normalizing your design at least that much.

The next subquery sub2 checks for a gap that's greater than two days to the previous row in the same partition.

Subquery sub3 then counts those gaps to identify distinct administration periods (admin)

In the outer SELECT, min(admin_startdate) and max(admin_enddate) per administration period produce the desired episode dates.

See (with assorted links to more):

  • How to group timestamps into islands (based on arbitrary gap)?

Query to find all timestamps more than a certain interval apart

To start a new session after every gap >= 1 hour:

SELECT user_id, count(*) AS distinct_sessions
FROM (
SELECT user_id
,(lag(request_time, 1, '-infinity') OVER (PARTITION BY user_id
ORDER BY request_time)
<= request_time - '1h'::interval) AS step -- start new session
FROM tbl
) sub
WHERE step
GROUP BY user_id
ORDER BY user_id;

Assuming request_time NOT NULL.

Explain:

  • In subquery sub, check for every row if a new session begins. Using the third parameter of lag() to provide the default -infinity, which is lower than any timestamp and therefore always starts a new session for the first row.

  • In the outer query count how many times new sessions started. Eliminate step = FALSE and count per user.

Alternative interpretation

If you really wanted to count hours where at least one request happened (I don't think you do, but another answer assumes as much), you would:

SELECT user_id
, count(DISTINCT date_trunc('hour', request_time)) AS hours_with_req
FROM tbl
GROUP BY 1
ORDER BY 1;

Forming groups of spatio-temporally near trajectories in R or PostgreSQL

UPDATE: [after understanding the real question ;-] Finding the equivalence groups of bikes (set, bike_set) is in fact a relational-division problem. Finding the begin and end of the segments (clust) within a set of bikes is basically the same as in the first attempt.

  • The clusters are stored in arrays: (I trust on the clusters not becoming too large)
  • The array is built by a recursive query: every pair of bikes that has one member in common with the current cluster is merged into it.
  • At the end, the arrays contain all bike_id's that happened to be within reach at the particular time.
  • (plus some intermediate rows which need to be suppressed later by the uniq CTE)
  • The rest is more-or-less standard gap-detection in time-series.

NOTE: the code trusts on (bike2 > bike1). This is needed to keep the array sorted and thus canonical. The actual content is not guaranteed to be canonical because the order of addition in the recursive query cannot be guaranteed. This may need some extra work.


CREATE TABLE nearness
( bike1 INTEGER NOT NULL
, bike2 INTEGER NOT NULL
, stamp timestamp NOT NULL
, near boolean
, PRIMARY KEY(bike1,bike2,stamp)
);
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
(1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE) -- <<-- these False-records serve no pupose
,(4,5, '2016-05-28 11:00:00', FALSE) -- <<-- result would be the same without them
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
;

-- Recursive union-find to glue together sets of bike_ids
-- ,occuring at the same moment.
-- Sets are represented as {ordered,unique} arrays here
WITH RECURSIVE wood AS (
WITH omg AS (
SELECT bike1 ,bike2,stamp
, row_number() OVER(ORDER BY bike1,bike2,stamp) AS seq
, ARRAY[bike1,bike2]::integer[] AS arr
FROM nearness n WHERE near = True
)
-- Find all existing combinations of bikes
SELECT o1.stamp, o1.seq
, ARRAY[o1.bike1,o1.bike2]::integer[] AS arr
FROM omg o1
UNION ALL
SELECT o2.stamp, o2.seq -- avoid duplicates inside the array
, CASE when o2.bike1 = ANY(w.arr) THEN w.arr || o2.bike2
ELSE w.arr || o2.bike1 END AS arr
FROM omg o2
JOIN wood w
ON o2.stamp = w.stamp AND o2.seq > w.seq
AND (o2.bike1 = ANY(w.arr) OR o2.bike2 = ANY(w.arr))
AND NOT (o2.bike1 = ANY(w.arr) AND o2.bike2 = ANY(w.arr))
)
, uniq AS ( -- suppress partial sets caused by the recursive union-find buildup
SELECT * FROM wood w
WHERE NOT EXISTS (SELECT * FROM wood nx
WHERE nx.stamp = w.stamp
AND nx.arr @> w.arr AND nx.arr <> w.arr -- contains but not equal
)
)
, xsets AS ( -- make unique sets of bikes
SELECT DISTINCT arr
-- , MIN(seq) AS grp
FROM uniq
GROUP BY arr
)
, sets AS ( -- enumerate the sets of bikes
SELECT arr
, row_number() OVER () AS setnum
FROM xsets
)
, drag AS ( -- Detect beginning and end of segments of consecutive observations
SELECT u.* -- within a constant set of bike_ids
-- Edge-detection begin of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp < u.stamp
AND nx.stamp >= u.stamp - '5 sec'::interval
) AS is_first
-- Edge-detection end of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp > u.stamp
AND nx.stamp <= u.stamp + '5 sec'::interval
) AS is_last
, row_number() OVER(ORDER BY arr,stamp) AS nseq
FROM uniq u
)
, top AS ( -- id and groupnum for the start of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_first
)
, bot AS ( -- id and groupnum for the end of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_last
)
SELECT w.seq as orgseq -- results, please ...
, w.stamp
, g0.clust AS clust
, row_number() OVER(www) AS rn
, s.setnum, s.arr AS bike_set
FROM drag w
JOIN sets s ON s.arr = w.arr
JOIN top g0 ON g0.nseq <= w.seq
JOIN bot g1 ON g1.nseq >= w.seq AND g1.clust = g0.clust
WINDOW www AS (PARTITION BY g1.clust ORDER BY w.stamp)
ORDER BY g1.clust, w.stamp
;

Result:


 orgseq |        stamp        | clust | rn | setnum | bike_set 
--------+---------------------+-------+----+--------+----------
1 | 2016-05-28 11:00:00 | 1 | 1 | 1 | {1,2}
4 | 2016-05-28 11:00:20 | 3 | 1 | 1 | {1,2}
5 | 2016-05-28 11:00:25 | 3 | 2 | 1 | {1,2}
6 | 2016-05-28 11:00:05 | 4 | 1 | 3 | {1,2,3}
7 | 2016-05-28 11:00:10 | 4 | 2 | 3 | {1,2,3}
8 | 2016-05-28 11:00:10 | 4 | 3 | 2 | {4,5}
(6 rows)

Correct way to eliminate duplicates from a window query

You need more sophistication if version downgrades (or NULL values?) are possible:

SELECT min(version) AS version, min(date), max(date)
FROM (
SELECT version, date
, count(*) FILTER (WHERE step IS NOT FALSE) OVER (ORDER BY date) AS grp
FROM (
SELECT version, date
, lag(version) OVER (ORDER BY date) <> version AS step
FROM cluster_info
WHERE cluster_id = '0f4ce21e-0d08-11ec-b209-0242c0a8c004'
ORDER BY date
) sub1
) sub2
GROUP BY grp;

db<>fiddle here (sample data extended with version downgrade and unknown version)

See (with detailed explanation and links to more):

  • Select longest continuous sequence
  • How to group timestamps into islands (based on arbitrary gap)?


Related Topics



Leave a reply



Submit