How to group timestamps into islands (based on arbitrary gap)?
This would do it:
SELECT done, count(*) FILTER (WHERE step) OVER (ORDER BY done) AS grp
FROM (
SELECT done
, (lag(done) OVER (ORDER BY done) <= done - interval '2 min') AS step
FROM tbl
) sub
ORDER BY done;
The subquery sub
records step
as true
if the previous row is at least 2 min away - sorted by the timestamp column done
itself in this case.
The outer query adds a rolling count of steps, effectively the group number (grp
) - combining the aggregate FILTER
clause with another window function.
db<>fiddle here
Related:
- Query to find all timestamps more than a certain interval apart
- How to label groups in postgresql when group belonging depends on the preceding line?
- Select longest continuous sequence
- Grouping or Window
About the aggregate FILTER
clause:
- How can I simplify this game statistics query?
- Conditional lead/lag function PostgreSQL?
How to form groups of consecutive dates allowing for a given maximum gap?
Counting the gaps (greater than given tolerance) in a second window function forms the group numbers you are after:
SELECT person_id, contact_day
, count(*) FILTER (WHERE gap > 3) OVER (PARTITION BY person_id ORDER BY contact_day) AS dash_group
FROM (
SELECT person_id, contact_day
, contact_day - lag(contact_day) OVER (PARTITION BY person_id ORDER BY contact_day) AS gap
FROM mydata
) sub
ORDER BY person_id, contact_day; -- optional
db<>fiddle here
About the aggregate FILTER
clause:
- Aggregate columns with additional (distinct) filters
It's short and intuitive, and typically fastest. See:
- For absolute performance, is SUM faster or COUNT?
It's the classic topic of "gaps and islands". Once you know to look for the tag gaps-and-islands, you'll find plenty of related or nigh identical questions and answers like:
- Select longest continuous sequence
- How to group timestamps into islands (based on arbitrary gap)?
- How to label groups in postgresql when group belonging depends on the preceding line?
Etc.
I tagged accordingly now.
How to merge intervals based on a minimum gap of dates?
This work in Postgres 14. Not tested for Redshift.
SELECT ptid, visit_start_date, visit_end_date, admin_startdate, admin_enddate
, min(admin_startdate) OVER (PARTITION BY visit_id, admin) AS episode_startdate
, max(admin_enddate) OVER (PARTITION BY visit_id, admin) AS episode_enddate
FROM (
SELECT *, count(*) FILTER (WHERE gap) OVER (PARTITION BY visit_id ORDER BY admin_startdate) AS admin
FROM (
SELECT *, admin_startdate - lag(admin_enddate) OVER (PARTITION BY visit_id ORDER BY admin_startdate) > 2 AS gap
FROM (
SELECT *, dense_rank() OVER (ORDER BY ptid, visit_start_date, visit_end_date) AS visit_id -- optional, to simplify
FROM tbl
) sub1
) sub2
) sub3
db<>fiddle here
The innermost subquery sub1
is only to compute a unique visit_id
- which should really be in your table instead of repeating (ptid, visit_start_date, visit_end_date )
over and over. Consider normalizing your design at least that much.
The next subquery sub2
checks for a gap that's greater than two days to the previous row in the same partition.
Subquery sub3
then counts those gaps to identify distinct administration periods (admin
)
In the outer SELECT
, min(admin_startdate)
and max(admin_enddate)
per administration period produce the desired episode dates.
See (with assorted links to more):
- How to group timestamps into islands (based on arbitrary gap)?
Query to find all timestamps more than a certain interval apart
To start a new session after every gap >= 1 hour:
SELECT user_id, count(*) AS distinct_sessions
FROM (
SELECT user_id
,(lag(request_time, 1, '-infinity') OVER (PARTITION BY user_id
ORDER BY request_time)
<= request_time - '1h'::interval) AS step -- start new session
FROM tbl
) sub
WHERE step
GROUP BY user_id
ORDER BY user_id;
Assuming request_time NOT NULL
.
Explain:
In subquery
sub
, check for every row if a new session begins. Using the third parameter oflag()
to provide the default-infinity
, which is lower than any timestamp and therefore always starts a new session for the first row.In the outer query count how many times new sessions started. Eliminate
step = FALSE
and count per user.
Alternative interpretation
If you really wanted to count hours where at least one request happened (I don't think you do, but another answer assumes as much), you would:
SELECT user_id
, count(DISTINCT date_trunc('hour', request_time)) AS hours_with_req
FROM tbl
GROUP BY 1
ORDER BY 1;
Forming groups of spatio-temporally near trajectories in R or PostgreSQL
UPDATE: [after understanding the real question ;-] Finding the equivalence groups of bikes (set, bike_set) is in fact a relational-division problem. Finding the begin and end of the segments (clust) within a set of bikes is basically the same as in the first attempt.
- The clusters are stored in arrays: (I trust on the clusters not becoming too large)
- The array is built by a recursive query: every pair of bikes that has one member in common with the current cluster is merged into it.
- At the end, the arrays contain all bike_id's that happened to be within reach at the particular time.
- (plus some intermediate rows which need to be suppressed later by the
uniq
CTE) - The rest is more-or-less standard gap-detection in time-series.
NOTE: the code trusts on (bike2 > bike1)
. This is needed to keep the array sorted and thus canonical. The actual content is not guaranteed to be canonical because the order of addition in the recursive query cannot be guaranteed. This may need some extra work.
CREATE TABLE nearness
( bike1 INTEGER NOT NULL
, bike2 INTEGER NOT NULL
, stamp timestamp NOT NULL
, near boolean
, PRIMARY KEY(bike1,bike2,stamp)
);
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
(1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE) -- <<-- these False-records serve no pupose
,(4,5, '2016-05-28 11:00:00', FALSE) -- <<-- result would be the same without them
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
;
-- Recursive union-find to glue together sets of bike_ids
-- ,occuring at the same moment.
-- Sets are represented as {ordered,unique} arrays here
WITH RECURSIVE wood AS (
WITH omg AS (
SELECT bike1 ,bike2,stamp
, row_number() OVER(ORDER BY bike1,bike2,stamp) AS seq
, ARRAY[bike1,bike2]::integer[] AS arr
FROM nearness n WHERE near = True
)
-- Find all existing combinations of bikes
SELECT o1.stamp, o1.seq
, ARRAY[o1.bike1,o1.bike2]::integer[] AS arr
FROM omg o1
UNION ALL
SELECT o2.stamp, o2.seq -- avoid duplicates inside the array
, CASE when o2.bike1 = ANY(w.arr) THEN w.arr || o2.bike2
ELSE w.arr || o2.bike1 END AS arr
FROM omg o2
JOIN wood w
ON o2.stamp = w.stamp AND o2.seq > w.seq
AND (o2.bike1 = ANY(w.arr) OR o2.bike2 = ANY(w.arr))
AND NOT (o2.bike1 = ANY(w.arr) AND o2.bike2 = ANY(w.arr))
)
, uniq AS ( -- suppress partial sets caused by the recursive union-find buildup
SELECT * FROM wood w
WHERE NOT EXISTS (SELECT * FROM wood nx
WHERE nx.stamp = w.stamp
AND nx.arr @> w.arr AND nx.arr <> w.arr -- contains but not equal
)
)
, xsets AS ( -- make unique sets of bikes
SELECT DISTINCT arr
-- , MIN(seq) AS grp
FROM uniq
GROUP BY arr
)
, sets AS ( -- enumerate the sets of bikes
SELECT arr
, row_number() OVER () AS setnum
FROM xsets
)
, drag AS ( -- Detect beginning and end of segments of consecutive observations
SELECT u.* -- within a constant set of bike_ids
-- Edge-detection begin of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp < u.stamp
AND nx.stamp >= u.stamp - '5 sec'::interval
) AS is_first
-- Edge-detection end of group
, NOT EXISTS (SELECT * FROM uniq nx
WHERE nx.arr = u.arr
AND nx.stamp > u.stamp
AND nx.stamp <= u.stamp + '5 sec'::interval
) AS is_last
, row_number() OVER(ORDER BY arr,stamp) AS nseq
FROM uniq u
)
, top AS ( -- id and groupnum for the start of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_first
)
, bot AS ( -- id and groupnum for the end of a group
SELECT nseq
, row_number() OVER () AS clust
FROM drag
WHERE is_last
)
SELECT w.seq as orgseq -- results, please ...
, w.stamp
, g0.clust AS clust
, row_number() OVER(www) AS rn
, s.setnum, s.arr AS bike_set
FROM drag w
JOIN sets s ON s.arr = w.arr
JOIN top g0 ON g0.nseq <= w.seq
JOIN bot g1 ON g1.nseq >= w.seq AND g1.clust = g0.clust
WINDOW www AS (PARTITION BY g1.clust ORDER BY w.stamp)
ORDER BY g1.clust, w.stamp
;
Result:
orgseq | stamp | clust | rn | setnum | bike_set
--------+---------------------+-------+----+--------+----------
1 | 2016-05-28 11:00:00 | 1 | 1 | 1 | {1,2}
4 | 2016-05-28 11:00:20 | 3 | 1 | 1 | {1,2}
5 | 2016-05-28 11:00:25 | 3 | 2 | 1 | {1,2}
6 | 2016-05-28 11:00:05 | 4 | 1 | 3 | {1,2,3}
7 | 2016-05-28 11:00:10 | 4 | 2 | 3 | {1,2,3}
8 | 2016-05-28 11:00:10 | 4 | 3 | 2 | {4,5}
(6 rows)
Correct way to eliminate duplicates from a window query
You need more sophistication if version downgrades (or NULL values?) are possible:
SELECT min(version) AS version, min(date), max(date)
FROM (
SELECT version, date
, count(*) FILTER (WHERE step IS NOT FALSE) OVER (ORDER BY date) AS grp
FROM (
SELECT version, date
, lag(version) OVER (ORDER BY date) <> version AS step
FROM cluster_info
WHERE cluster_id = '0f4ce21e-0d08-11ec-b209-0242c0a8c004'
ORDER BY date
) sub1
) sub2
GROUP BY grp;
db<>fiddle here (sample data extended with version downgrade and unknown version)
See (with detailed explanation and links to more):
- Select longest continuous sequence
- How to group timestamps into islands (based on arbitrary gap)?
Related Topics
Sqlite Order by Date1530019888000
Group Datetime into 5,15,30 and 60 Minute Intervals
What's the Difference Between a Table Scan and a Clustered Index Scan
Bigquery SQL for 28-Day Sliding Window Aggregate (Without Writing 28 Lines of SQL)
How to Drop a Default Value or Similar Constraint in T-Sql
SQL Query Joins Multiple Tables - Too Slow (8 Tables)
Union the Results of Multiple Stored Procedures
Rails Find Record with Zero Has_Many Records Associated
How to Delete the Top 1000 Rows from a Table Using SQL Server 2008
How to Execute a Native SQL Script in JPA/Hibernate
How to Select the First N Rows of Each Group
Find Duplicate Rows with Postgresql
Division (/) Not Giving My Answer in Postgresql