Efficiently Querying a Huge Time Series Table for One Row Every 15 Minutes

Efficiently querying a huge time series table for one row every 15 minutes

For 15 minutes intervals:

with i as (
    select cf.tagindex, min(dateandtime) dateandtime
    from contfloattable cf
    group by
        floor(extract(epoch from dateandtime) / 60 / 15),
        cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
    contfloattable cf
    inner join
    conttagtable t on cf.tagindex = t.tagindex
    inner join
    i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by cf.dateandtime, t.tagname

Show the explain output for this query (if it works) so we can try to optimize. You can post it in this answer.

Explain Output

"Sort  (cost=15102462177.06..15263487805.24 rows=64410251271 width=57)"
"  Sort Key: cf.dateandtime, t.tagname"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49481978.32 rows=19436288 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Hash Join  (cost=270117658.06..1067549320.69 rows=64410251271 width=57)"
"        Hash Cond: (cf.tagindex = t.tagindex)"
"        ->  Merge Join  (cost=270117116.39..298434544.23 rows=1408582784 width=25)"
"              Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"              ->  Sort  (cost=2741707.02..2790297.74 rows=19436288 width=12)"
"                    Sort Key: i.tagindex, i.dateandtime"
"                    ->  CTE Scan on i  (cost=0.00..388725.76 rows=19436288 width=12)"
"              ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"                    ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                          Sort Key: cf.tagindex, cf.dateandtime"
"                          ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"        ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"              ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

It looks like you need this index:

create index cf_tag_datetime on contfloattable (tagindex, dateandtime)

Run analyze after creating it. Now notice that any index on a big table will have a significant performance impact on data changes (insert etc) as it will have to be updated at each change.

Update

I added the cf_tag_datetime index (tagindex,dateandtime) and here's the new explain:

"Sort  (cost=15349296514.90..15512953953.25 rows=65462975340 width=57)"
"  Sort Key: cf.dateandtime, t.tagname"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49490287.76 rows=19851760 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Hash Join  (cost=270179293.86..1078141313.22 rows=65462975340 width=57)"
"        Hash Cond: (cf.tagindex = t.tagindex)"
"        ->  Merge Join  (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
"              Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"              ->  Sort  (cost=2803342.82..2852972.22 rows=19851760 width=12)"
"                    Sort Key: i.tagindex, i.dateandtime"
"                    ->  CTE Scan on i  (cost=0.00..397035.20 rows=19851760 width=12)"
"              ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"                    ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                          Sort Key: cf.tagindex, cf.dateandtime"
"                          ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"        ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"              ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

It seems to have gone up in time :( However, if I remove the order by clause (not exactly what i need, but would work), this is what happens, big reduction:

"Hash Join  (cost=319669581.62..1127631600.98 rows=65462975340 width=57)"
"  Hash Cond: (cf.tagindex = t.tagindex)"
"  CTE i"
"    ->  HashAggregate  (cost=49093252.56..49490287.76 rows=19851760 width=12)"
"          ->  Seq Scan on contfloattable cf  (cost=0.00..38528881.68 rows=1408582784 width=12)"
"  ->  Merge Join  (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
"        Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
"        ->  Sort  (cost=2803342.82..2852972.22 rows=19851760 width=12)"
"              Sort Key: i.tagindex, i.dateandtime"
"              ->  CTE Scan on i  (cost=0.00..397035.20 rows=19851760 width=12)"
"        ->  Materialize  (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
"              ->  Sort  (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
"                    Sort Key: cf.tagindex, cf.dateandtime"
"                    ->  Seq Scan on contfloattable cf  (cost=0.00..24443053.84 rows=1408582784 width=21)"
"  ->  Hash  (cost=335.74..335.74 rows=16474 width=44)"
"        ->  Seq Scan on conttagtable t  (cost=0.00..335.74 rows=16474 width=44)"

I have not yet tried this index...will do so though. standby.

Now looking at it again I think the inverse index could be even better as it can be used not only in the Merge Join but also in the final Sort:

create index cf_tag_datetime on contfloattable (dateandtime, tagindex)

Best method to determine if one timestamp lies within 15 minutes of another

SELECT *, kpi_start AND kpi_end AS kpi_startend
FROM  (
   SELECT job_no
        , b.actual_start - a.rostered_start BETWEEN interval '- 15 min' AND interval '15 min' AS kpi_start
        , b.actual_end   - a.rostered_end   BETWEEN interval '- 15 min' AND interval '15 min' AS kpi_end
   FROM   tbl_a a
   JOIN   tbl_b b USING (job_no)
   ) sub;

Using more natural boolean values instead of PASS / FAIL. Easy to translate.

There is no built-in abs() function for the interval data type, just abs(numeric_type). But the BETWEEN ... AND ... construct is at least as performant and simple enough.

SQL - how can I efficiently select multiple closest time series data points

To get multiple points:

select d.* 
from data d
where recorded_at > '2015-01-01 01:01:01'
order by recorded_at
limit 100

To get them efficiently, create an index on data(recorded_at):

create index idx_data_recorded_at on data(recorded_at);

Also, a union would probably not help your cause. Although limit without an order by is not guaranteed to return any particular row, it would generally return the same row, over and over.

EDIT (based on question edit):

For multiple dates, this gets a bit harder. You are probably best off using union all:

(select d.* 
 from data d
 where recorded_at > '2015-01-01 01:01:01'
 order by recorded_at
 limit 1
) union all
(select d.* 
 from data d
 where recorded_at > '2015-02-01 01:01:01'
 order by recorded_at
 limit 1
)

Although complicated, each subquery should be blazingly fast with the right index so the overall query should be quite fast.

generate series at 15 minute interval but skip every other week

You need two loops - one for a first day of week and the other for 15 minutes intervals inside the week. In SQL cross join is used to implement nested loops:

select tstamp
from generate_series('2017-01-01', '2017-12-31', '2 week'::interval) as start_of_week
cross join generate_series(start_of_week::date, start_of_week::date+ '1 week'::interval, '15 min'::interval) as tstamp

How to efficiently join two huge tables by nearest timestamp?

What you want is called "as-of join". That joins each timestamp to the nearest value in the other table.

Some time-series databases, like clickhouse, support this directly. This is the only way to make it fast. It is quite similar to a merge join, with a few modifications: the engine must scan both tables in timestamp order, and join to the nearest value row instead of the equal value row.

I've looked into it briefly and it doesn't look like timescaledb supports it, but this post shows a workaround using lateral join and a covering index. This is likely to have similar performance to your query, because it will use a nested loop and an index-only scan to pick the nearest value for each row.

RethinkDB - How to group on aggregated time intervals

I actually met a similar problem and here is what I did:

.group(r.row('date').toEpochTime().sub(r.row('date').toEpochTime().mod(<INTERVAL_IN_SECONDS>))

What this do is to group time by <INTERVAL_IN_SECONDS>

I don't know if this is the best way for the task but it works for me.

mysql query - optimizing existing MAX-MIN query for a huge table

Adding suitable indexes will help.
2 compound indexes will speed things up significantly:

ALTER TABLE tbl_name ADD INDEX (error, exportValue);
ALTER TABLE tbl_name ADD INDEX (station, fetchDate);

Efficiently Querying a Huge Time Series Table for One Row Every 15 Minutes