Efficiently querying a huge time series table for one row every 15 minutes
For 15 minutes intervals:
with i as (
select cf.tagindex, min(dateandtime) dateandtime
from contfloattable cf
group by
floor(extract(epoch from dateandtime) / 60 / 15),
cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
contfloattable cf
inner join
conttagtable t on cf.tagindex = t.tagindex
inner join
i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by cf.dateandtime, t.tagname
Show the explain output for this query (if it works) so we can try to optimize. You can post it in this answer.
Explain Output
"Sort (cost=15102462177.06..15263487805.24 rows=64410251271 width=57)"
" Sort Key: cf.dateandtime, t.tagname"
" CTE i"
" -> HashAggregate (cost=49093252.56..49481978.32 rows=19436288 width=12)"
" -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)"
" -> Hash Join (cost=270117658.06..1067549320.69 rows=64410251271 width=57)"
" Hash Cond: (cf.tagindex = t.tagindex)"
" -> Merge Join (cost=270117116.39..298434544.23 rows=1408582784 width=25)"
" Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
" -> Sort (cost=2741707.02..2790297.74 rows=19436288 width=12)"
" Sort Key: i.tagindex, i.dateandtime"
" -> CTE Scan on i (cost=0.00..388725.76 rows=19436288 width=12)"
" -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
" -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
" Sort Key: cf.tagindex, cf.dateandtime"
" -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)"
" -> Hash (cost=335.74..335.74 rows=16474 width=44)"
" -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"
It looks like you need this index:
create index cf_tag_datetime on contfloattable (tagindex, dateandtime)
Run analyze
after creating it. Now notice that any index on a big table will have a significant performance impact on data changes (insert etc) as it will have to be updated at each change.
Update
I added the cf_tag_datetime index (tagindex,dateandtime) and here's the new explain:
"Sort (cost=15349296514.90..15512953953.25 rows=65462975340 width=57)"
" Sort Key: cf.dateandtime, t.tagname"
" CTE i"
" -> HashAggregate (cost=49093252.56..49490287.76 rows=19851760 width=12)"
" -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)"
" -> Hash Join (cost=270179293.86..1078141313.22 rows=65462975340 width=57)"
" Hash Cond: (cf.tagindex = t.tagindex)"
" -> Merge Join (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
" Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
" -> Sort (cost=2803342.82..2852972.22 rows=19851760 width=12)"
" Sort Key: i.tagindex, i.dateandtime"
" -> CTE Scan on i (cost=0.00..397035.20 rows=19851760 width=12)"
" -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
" -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
" Sort Key: cf.tagindex, cf.dateandtime"
" -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)"
" -> Hash (cost=335.74..335.74 rows=16474 width=44)"
" -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"
It seems to have gone up in time :( However, if I remove the order by clause (not exactly what i need, but would work), this is what happens, big reduction:
"Hash Join (cost=319669581.62..1127631600.98 rows=65462975340 width=57)"
" Hash Cond: (cf.tagindex = t.tagindex)"
" CTE i"
" -> HashAggregate (cost=49093252.56..49490287.76 rows=19851760 width=12)"
" -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)"
" -> Merge Join (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
" Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
" -> Sort (cost=2803342.82..2852972.22 rows=19851760 width=12)"
" Sort Key: i.tagindex, i.dateandtime"
" -> CTE Scan on i (cost=0.00..397035.20 rows=19851760 width=12)"
" -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
" -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
" Sort Key: cf.tagindex, cf.dateandtime"
" -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)"
" -> Hash (cost=335.74..335.74 rows=16474 width=44)"
" -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"
I have not yet tried this index...will do so though. standby.
Now looking at it again I think the inverse index could be even better as it can be used not only in the Merge Join
but also in the final Sort
:
create index cf_tag_datetime on contfloattable (dateandtime, tagindex)
Best method to determine if one timestamp lies within 15 minutes of another
SELECT *, kpi_start AND kpi_end AS kpi_startend
FROM (
SELECT job_no
, b.actual_start - a.rostered_start BETWEEN interval '- 15 min' AND interval '15 min' AS kpi_start
, b.actual_end - a.rostered_end BETWEEN interval '- 15 min' AND interval '15 min' AS kpi_end
FROM tbl_a a
JOIN tbl_b b USING (job_no)
) sub;
Using more natural boolean
values instead of PASS / FAIL. Easy to translate.
There is no built-in abs()
function for the interval
data type, just abs(numeric_type)
. But the BETWEEN ... AND ...
construct is at least as performant and simple enough.
SQL - how can I efficiently select multiple closest time series data points
To get multiple points:
select d.*
from data d
where recorded_at > '2015-01-01 01:01:01'
order by recorded_at
limit 100
To get them efficiently, create an index on data(recorded_at)
:
create index idx_data_recorded_at on data(recorded_at);
Also, a union
would probably not help your cause. Although limit
without an order by
is not guaranteed to return any particular row, it would generally return the same row, over and over.
EDIT (based on question edit):
For multiple dates, this gets a bit harder. You are probably best off using union all
:
(select d.*
from data d
where recorded_at > '2015-01-01 01:01:01'
order by recorded_at
limit 1
) union all
(select d.*
from data d
where recorded_at > '2015-02-01 01:01:01'
order by recorded_at
limit 1
)
Although complicated, each subquery should be blazingly fast with the right index so the overall query should be quite fast.
generate series at 15 minute interval but skip every other week
You need two loops - one for a first day of week and the other for 15 minutes intervals inside the week. In SQL cross join is used to implement nested loops:
select tstamp
from generate_series('2017-01-01', '2017-12-31', '2 week'::interval) as start_of_week
cross join generate_series(start_of_week::date, start_of_week::date+ '1 week'::interval, '15 min'::interval) as tstamp
How to efficiently join two huge tables by nearest timestamp?
What you want is called "as-of join". That joins each timestamp to the nearest value in the other table.
Some time-series databases, like clickhouse, support this directly. This is the only way to make it fast. It is quite similar to a merge join, with a few modifications: the engine must scan both tables in timestamp order, and join to the nearest value row instead of the equal value row.
I've looked into it briefly and it doesn't look like timescaledb supports it, but this post shows a workaround using lateral join and a covering index. This is likely to have similar performance to your query, because it will use a nested loop and an index-only scan to pick the nearest value for each row.
RethinkDB - How to group on aggregated time intervals
I actually met a similar problem and here is what I did:
.group(r.row('date').toEpochTime().sub(r.row('date').toEpochTime().mod(<INTERVAL_IN_SECONDS>))
What this do is to group time by <INTERVAL_IN_SECONDS>
I don't know if this is the best way for the task but it works for me.
mysql query - optimizing existing MAX-MIN query for a huge table
Adding suitable indexes will help.
2 compound indexes will speed things up significantly:
ALTER TABLE tbl_name ADD INDEX (error, exportValue);
ALTER TABLE tbl_name ADD INDEX (station, fetchDate);
Related Topics
Writing SQL Query for Getting Maximum Occurrence of a Value in a Column
Cannot Delete and Update Records on Access Linked Table
How to Select Row with Max Value When Duplicate Rows Exist in SQL Server
How to Add Months to a Current_Timestamp in Sql
Sqlserver - How to Find Dependent Tables on My Table
Conditional Unique Constraint with Multiple Fields in Oracle Db
How to Remove The Default Value from a Column in Oracle
In Oracle, Is Starting The SQL Query's Where Clause with 1=1 Useful
What SQL Query or View Will Show "Dynamic Columns"
Apply Like Over All Columns Without Specifying All Column Names
The Identifier That Starts with ...... Is Too Long. Maximum Length Is 128
How to Concatenate Multiple Rows' Fields in a Sap Hana Table
Performance Difference Between Primary Key and Unique Clustered Index in SQL Server
Get The Type of a Variable in MySQL