Finding Simultaneous Events in a Database Between Times

Calculating working time with overlapping events (SQL)

As suggested by posters, indexing made a significant difference - creating an index with wka_StartTime and wka_EndTime sorted it.

(sorry, couldn't see how to mark the comments made by others as an answer!)

Data between two datetimes

This is my preferred way. It starts with a list of all the times when calls start or stop, and then calculates the number of simultaneous calls:

with t as (
select starttime as thetime, 1 as isstart, 0 as isend
from calls
union all
select endtime, 0 as isstart, 1 as issend
from calls
),
t2 as (
select thetime, row_number() over (order by thetime) as numcalls,
isstart, isend,
row_number() over (partition by isstart order by thetime) as cumstartsorends
from t
),
t3 as (
select thetime,
(case when isstart = 1 then cumstartsorends
else numcalls - cumstartsorends
end) as numstarts,
(case when isend = 1 then cumstartsorends
else numcalls - cumstartsorends
end) as numends
from t2
)
select year(thetime) as yr, month(thetime) as mon,
max(numstarts - numends) as maxcum
from t3
group by year(thetime), month(thetime)

What this query really wants to do is a cumulative sum on each time (start time or end time. However, that is not supported until 2012.

So, it does a trick. It calculates the cumulative number of starts and stops. However, these are only on the start time or end time record. To get the other value, it also calculates the total starts and stops, and subtracts this value. So, each time stamp has the cumulative number of starts and stops. The difference is the number of concurrent calls.

Calculate number of concurrent events in SQL

1.) Your query did not catch all overlaps - this was fixed by the other answers, already.

2.) The data type of your columns starttime and endtime is timestamp. So your WHERE clause is slightly wrong, too:

BETWEEN '2011-11-02' AND '2011-11-03'

This would include '2011-11-03 00:00'. The upper border has to be excluded.

3.) Removed the mixed case syntax without double-quotes. Unquoted identifiers are cast to lower case automatically. To put it simple: Best don't use mixed case identifiers at all in PostgreSQL.

4.) Transformed the query to use explicit JOIN which is always preferable. Actually, I made it a LEFT [OUTER] JOIN, because I want to count calls that overlap with no other calls, too.

5.) Simplified the syntax a bit to arrive at this base query:

SELECT t1.sid, count(*) AS ct
FROM calls_nov t1
LEFT JOIN calls_nov t2 ON t1.starttime <= t2.endtime
AND t1.endtime >= t2.starttime
WHERE t1.starttime >= '2011-11-02 0:0'::timestamp
AND t1.starttime < '2011-11-03 0:0'::timestamp
GROUP BY 1
ORDER BY 2 DESC;

This query is extremely slow for a big table, because every row starting on '2011-11-02' has to be compared to every row in the whole table, which leads to (almost) O(n²) cost.



Faster

We can drastically cut down the cost by pre-selecting possible candidates. Only select columns and rows you need. I do this with two CTE.

  1. Select calls starting on the day in question. -> CTE x
  2. Calculate the latest end of those calls. (subquery in CTE y)
  3. Select only calls that overlap with the total range of CTE x. -> CTE y
  4. The final query is much faster than querying the huge underlying table.

WITH x AS (
SELECT sid, starttime, endtime
FROM calls_nov
WHERE starttime >= '2011-11-02 0:0'
AND starttime < '2011-11-03 0:0'
), y AS (
SELECT starttime, endtime
FROM calls_nov
WHERE endtime >= '2011-11-02 0:0'
AND starttime <= (SELECT max(endtime) As max_endtime FROM x)
)
SELECT x.sid, count(*) AS count_overlaps
FROM x
LEFT JOIN y ON x.starttime <= y.endtime
AND x.endtime >= y.starttime
GROUP BY 1
ORDER BY 2 DESC;


Faster yet

I have a real life table of 350.000 rows with overlapping start / end timestamps similar to yours. I used that for a quick benchmark. PostgreSQL 8.4, scarce resources because it is a test DB. Indexes on start and end. (Index on ID column is irrelevant here.) Tested with EXPLAIN ANALYZE, best of 5.

Total runtime: 476994.774 ms

CTE variant:

Total runtime: 4199.788 ms -- that's > factor 100.

After adding a multicolumn index of the form:

CREATE INDEX start_end_index on calls_nov (starttime, endtime);

Total runtime: 4159.367 ms



Ultimate Speed

If that is not enough, there is a way to speed it up yet another order of magnitude. Instead of the CTEs above, materialize the temp tables and - this is the crucial point - create an index on the second one. Could look like this:

Execute as one transaction:

CREATE TEMP TABLE x ON COMMIT DROP AS   
SELECT sid, starttime, endtime
FROM calls_nov
WHERE starttime >= '2011-11-02 0:0'
AND starttime < '2011-11-03 0:0';

CREATE TEMP TABLE y ON COMMIT DROP AS
SELECT starttime, endtime
FROM calls_nov
WHERE endtime >= '2011-11-02 0:0'
AND starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime); -- this is where the magic happens

SELECT x.sid, count(*) AS ct
FROM x
LEFT JOIN y ON x.starttime <= y.endtime
AND x.endtime >= y.starttime
GROUP BY 1
ORDER BY 2 DESC;

Read about temporary tables in the manual.



Ultimate solution

  • Create a plpgsql function that encapsulates the magic.

  • Diagnose the typical size of your temp tables. Create them standalone and measure:

      SELECT pg_size_pretty(pg_total_relation_size('tmp_tbl'));
  • If they are bigger than your setting for temp_buffers then temporarily set them high enough in your function to hold both your temporary tables in RAM. It is a major speedup if you don't have to swap to disc. (Must be first use of temp tables in session to have effect.)

CREATE OR REPLACE FUNCTION f_call_overlaps(date)
RETURNS TABLE (sid varchar, ct integer) AS
$BODY$
DECLARE
_from timestamp := $1::timestamp;
_to timestamp := ($1 +1)::timestamp;
BEGIN

SET temp_buffers = 64MB'; -- example value; more RAM for temp tables;

CREATE TEMP TABLE x ON COMMIT DROP AS
SELECT c.sid, starttime, endtime -- avoid naming conflict with OUT param
FROM calls_nov c
WHERE starttime >= _from
AND starttime < _to;

CREATE TEMP TABLE y ON COMMIT DROP AS
SELECT starttime, endtime
FROM calls_nov
WHERE endtime >= _from
AND starttime <= (SELECT max(endtime) FROM x);

CREATE INDEX y_idx ON y (starttime, endtime);

RETURN QUERY
SELECT x.sid, count(*)::int -- AS ct
FROM x
LEFT JOIN y ON x.starttime <= y.endtime AND x.endtime >= y.starttime
GROUP BY 1
ORDER BY 2 DESC;

END;
$BODY$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_call_overlaps('2011-11-02') -- just name your date

Total runtime: 138.169 ms -- that's factor 3000



What else can you do to speed it up?

General performance optimization.

CLUSTER calls_nov USING starttime_index; -- this also vacuums the table fully

ANALYZE calls_nov;

Finding number of concurrent events given start and end times

You're looking for an interval tree.

  • Construction: O(n log n), where n is the number of intervals
  • Query: O(m+log n), where m is the number of query results and n is the number of intervals
  • Space: O(n)

Finding events within a given timeframe with SQL

I am not quite sure of your schema, so am guessing slightly, however the first thing I have notices is that in your CTE x, you are selecting MAX(DaysSup) but also grouping by dayssup, making the max redundant.

However, I don't really think this is relevant to your problem. I would personally take a different approach to solving this. I am assuming you have a table along the lines of:

CREATE TABLE rx
( PatID INT,
FillDate DATE,
Dayssup INT,
DrugName VARCHAR(50)
)

So you can do something along the lines of:

SELECT  rx.PatID,
rx.FillDate,
rx.DrugName,
[DateTaken] = DATEADD(DAY, v.Number, FillDate)
FROM RX
INNER JOIN master..spt_values v
ON v.Number BETWEEN 0 AND rx.DaysSup
AND v.Type = 'P'

This will give a list of all the dates a drug has been taken by each patient, rather than a range, so you can then use something like:

WITH x AS
( SELECT rx.PatID,
rx.FillDate,
rx.DrugName,
[DateTaken] = DATEADD(DAY, v.Number, FillDate)
FROM rx
INNER JOIN master..spt_values v
ON v.Number BETWEEN 0 AND rx.DaysSup
AND v.Type = 'P'
), y AS
( SELECT x.PatID,
x.DateTaken,
DrugsTaken = COUNT(DISTINCT x.DrugName)
FROM x
GROUP BY x.PatID, x.DateTaken
HAVING COUNT(DISTINCT x.DrugName) >= 7
), z AS
( SELECT *,
GroupID = DATEDIFF(DAY, - ROW_NUMBER() OVER(PARTITION BY PatID ORDER BY DateTaken DESC), DateTaken)
FROM y
)
SELECT z.PatID,
[MostConccurent] = MAX(z.DrugsTaken),
[DateStarted] = MIN(z.DateTaken),
[DateEnded] = MAX(z.DateTaken)
FROM z
GROUP BY z.PatID, z.GroupID;

The first part I have covered, the second part simply limits the results to all dates with 7 or more drugs. The third CTE groups each patient by consecutive dates, and the last gets the min and max for each of those dates.

If you need a list of the drugs taken on each of those dates you can then join back to the cte x:

SELECT  z.PatID, 
x.DrugName,
[MostConccurent] = MAX(z.DrugsTaken),
[DateStarted] = MIN(z.DateTaken),
[DateEnded] = MAX(z.DateTaken)
FROM z
INNER JOIN x
ON x.PatID = z.PatID
AND x.DateTaken = z.DateTaken
GROUP BY z.PatID, z.GroupID, x.DrugName;

Oracle SQL - Calculating Number of Concurrent Events

You can calculate the number of concurrent events by using a relatively simple technique: cumulative aggregation. The idea is to count the number of starts and stops. Then the cumulative number is the number of concurrent values.

select tm, sum(isstart) as numstarts, sum(isstop) as numstops,
(sum(sum(isstart)) over (order by tm nulls last) -
sum(sum(isstop)) over (order by tm nulls last)
) as NumConcurrent
from ((select start_tm as tm, 1 as isstart, 0 as isstop from events
) union all
(select stop_tm, 0 as isstart, 1 as isstop from events
)
) e
group by tm;

This gives you the number of concurrent events for each time in the data (either a start or end time. You can then extract the maximum value for a day or hour using a where clause and order by/fetch first or aggregation.

Start and end times - how many concurrent events per hour/day/week etc

I think the easiest way would be to use a database or a standalone sort program or subroutine to pick out the start and stop events in ascending order of time and then process them with a simple program or routine.

As you read in events keep a running counter of open events - add one when you see a start and subtract one when you see a sort.

Then every time you pick up a new event you can work out how long it was since the last event, and you know how many events were open during this time. So if this was e.g. 5 minutes with two events open then you have seen 10 event-minutes of concurrent events.

If you total these event-minutes up and divide by the length of time from the first event to the last event you will have a measure of the average number of concurrent events - if you pick a random instant between the first and last event then the average number of concurrent events going on during that random instant will be this measure of average number of concurrent events.



Related Topics



Leave a reply



Submit