Efficient Time Series Querying in Postgres

Efficient time series querying in Postgres

SQL Fiddle

select
widget_id,
for_date,
case
when score is not null then score
else first_value(score) over (partition by widget_id, c order by for_date)
end score
from (
select
a.widget_id,
a.for_date,
s.score,
count(score) over(partition by a.widget_id order by a.for_date) c
from (
select widget_id, g.d::date for_date
from (
select distinct widget_id
from score
) s
cross join
generate_series(
(select min(for_date) from score),
(select max(for_date) from score),
'1 day'
) g(d)
) a
left join
score s on a.widget_id = s.widget_id and a.for_date = s.for_date
) s
order by widget_id, for_date

Time Series data on PostgreSQL

Postgres has very good storage for time series data - arrays. I remember, I used it 20 years ago for this purpose. You can transform these data to table with unnest function. The arrays use transparent compression.

Timescaledb use it internally with massive partitioning and with some extending optimizer and planner - so using an arrays is transparent for users. TimescaleDB is very good product, smart solution, and extending Postgres's possibilities very well. Postgres has only generic compressions, but Timescale supports special compression's techniques designed for time series data - that is much more effective. It is really good product.

The work with time series data is usually much simpler than work with general data. Time series data (and queries) has clean characteristic and these data are append only. It's not too hard to write time series database. There are lot of products from of this kind. The benefit of timescale is the fact, that it is extension of Postgres. So if you know Postgres, you know timescaledb. And against other time series databases, it is SQL based - so if you know SQL, you don't need to learn new special language.

Attention - lot of SQL databases supports temporal data, but it is different than time series data.

Querying time series in Postgress

One problem with the way you are currently doing it is that it does not generate a
data point in any invervals which do not have any sample data. For example, if the
user wants a chart from seconds 0 - 10 in steps of 1, then your chart won't have any
points after 5. Maybe that doesn't matter in your use case though.

Another issue, as you indicated, it would be nice to be able to use some kind of
linear interpolation to attribute the measurements in case the resolution of the
requested plots is greater than the available data.

To solve the first of these, instead of selecting data purely from the sample table,
we can join together the data with a generated series that matches the user's
request. The latter can be generated using this:

SELECT int4range(rstart, rstart+1) AS srange 
FROM generate_series(0,10,1) AS seq(rstart)

The above query will generate a series of ranges, from 0 to 10 with a step size
of 1. The output looks like this:

 srange
---------
[0,1)
[1,2)
[2,3)
[3,4)
[4,5)
[5,6)
[6,7)
[7,8)
[8,9)
[9,10)
[10,11)
(11 rows)

We can join this to the data table, using the && operator (which filters on overlap).

The second point can be addressed by calculating the proportion of each data row
which falls into each sample window.

Here is the full query:

SELECT lower(srange) AS t,
sum (CASE
-- when data range is fully contained in sample range
WHEN drange <@ srange THEN value
-- when data range and sample range overlap, calculate the ratio of the intersection
-- and use that to apportion the value
ELSE CAST (value AS DOUBLE PRECISION) * (upper(drange*srange) - lower(drange*srange)) / (upper(drange)-lower(drange))
END) AS value
FROM (
-- Generate the range to be plotted (the sample ranges).
-- To change the start / end of the range, change the 1st 2 arguments
-- of the generate_series. To change the step size change BOTH the 3rd
-- argument and the amount added to rstart (they must be equal).
SELECT int4range(rstart, rstart+1) AS srange FROM generate_series(0,10,1) AS seq(rstart)
) AS s
LEFT JOIN (
-- Note the use of the lag window function so that for each row, we get
-- a range from the previous timestamp up to the current timestamp
SELECT int4range(coalesce(lag(ts) OVER (order by ts), 0), ts) AS drange, value FROM data
) AS d ON srange && drange
GROUP BY lower(srange)
ORDER BY lower(srange)

Result:

 t  |      value
----+------------------
0 | 5
1 | 2
2 | 3.33333333333333
3 | 3.33333333333333
4 | 3.33333333333333
5 |
6 |
7 |
8 |
9 |
10 |
(11 rows)

It is not likely any index will be used on ts in this query as it stands, and
if the data table is large then performance is going to be dreadful.

There are some things you could try to help with this. One suggestion could be
to redesign the data table such that the first column contains the time range of
the data sample, rather than just the ending time, and then you could add a
range index. You could then remove the windowing function from the second
subquery, and hopefully the index can be used.

Read up on range types here.

Caveat Emptor: I have not tested this other than on the tiny data sample you supplied.
I have used something similar to this for a somewhat similar purpose though.

Efficiently querying a huge time series table for one row every 15 minutes

For 15 minutes intervals:

with i as (
select cf.tagindex, min(dateandtime) dateandtime
from contfloattable cf
group by
floor(extract(epoch from dateandtime) / 60 / 15),
cf.tagindex
)
select cf.dateandtime, cf."Val", cf.status, t.tagname
from
contfloattable cf
inner join
conttagtable t on cf.tagindex = t.tagindex
inner join
i on i.tagindex = cf.tagindex and i.dateandtime = cf.dateandtime
order by cf.dateandtime, t.tagname

Show the explain output for this query (if it works) so we can try to optimize. You can post it in this answer.

Explain Output

"Sort  (cost=15102462177.06..15263487805.24 rows=64410251271 width=57)"
" Sort Key: cf.dateandtime, t.tagname"
" CTE i"
" -> HashAggregate (cost=49093252.56..49481978.32 rows=19436288 width=12)"
" -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)"
" -> Hash Join (cost=270117658.06..1067549320.69 rows=64410251271 width=57)"
" Hash Cond: (cf.tagindex = t.tagindex)"
" -> Merge Join (cost=270117116.39..298434544.23 rows=1408582784 width=25)"
" Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
" -> Sort (cost=2741707.02..2790297.74 rows=19436288 width=12)"
" Sort Key: i.tagindex, i.dateandtime"
" -> CTE Scan on i (cost=0.00..388725.76 rows=19436288 width=12)"
" -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
" -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
" Sort Key: cf.tagindex, cf.dateandtime"
" -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)"
" -> Hash (cost=335.74..335.74 rows=16474 width=44)"
" -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"

It looks like you need this index:

create index cf_tag_datetime on contfloattable (tagindex, dateandtime)

Run analyze after creating it. Now notice that any index on a big table will have a significant performance impact on data changes (insert etc) as it will have to be updated at each change.

Update

I added the cf_tag_datetime index (tagindex,dateandtime) and here's the new explain:

"Sort  (cost=15349296514.90..15512953953.25 rows=65462975340 width=57)"
" Sort Key: cf.dateandtime, t.tagname"
" CTE i"
" -> HashAggregate (cost=49093252.56..49490287.76 rows=19851760 width=12)"
" -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)"
" -> Hash Join (cost=270179293.86..1078141313.22 rows=65462975340 width=57)"
" Hash Cond: (cf.tagindex = t.tagindex)"
" -> Merge Join (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
" Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
" -> Sort (cost=2803342.82..2852972.22 rows=19851760 width=12)"
" Sort Key: i.tagindex, i.dateandtime"
" -> CTE Scan on i (cost=0.00..397035.20 rows=19851760 width=12)"
" -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
" -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
" Sort Key: cf.tagindex, cf.dateandtime"
" -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)"
" -> Hash (cost=335.74..335.74 rows=16474 width=44)"
" -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"

It seems to have gone up in time :( However, if I remove the order by clause (not exactly what i need, but would work), this is what happens, big reduction:

"Hash Join  (cost=319669581.62..1127631600.98 rows=65462975340 width=57)"
" Hash Cond: (cf.tagindex = t.tagindex)"
" CTE i"
" -> HashAggregate (cost=49093252.56..49490287.76 rows=19851760 width=12)"
" -> Seq Scan on contfloattable cf (cost=0.00..38528881.68 rows=1408582784 width=12)"
" -> Merge Join (cost=270178752.20..298499296.08 rows=1408582784 width=25)"
" Merge Cond: ((i.tagindex = cf.tagindex) AND (i.dateandtime = cf.dateandtime))"
" -> Sort (cost=2803342.82..2852972.22 rows=19851760 width=12)"
" Sort Key: i.tagindex, i.dateandtime"
" -> CTE Scan on i (cost=0.00..397035.20 rows=19851760 width=12)"
" -> Materialize (cost=267375409.37..274418323.29 rows=1408582784 width=21)"
" -> Sort (cost=267375409.37..270896866.33 rows=1408582784 width=21)"
" Sort Key: cf.tagindex, cf.dateandtime"
" -> Seq Scan on contfloattable cf (cost=0.00..24443053.84 rows=1408582784 width=21)"
" -> Hash (cost=335.74..335.74 rows=16474 width=44)"
" -> Seq Scan on conttagtable t (cost=0.00..335.74 rows=16474 width=44)"

I have not yet tried this index...will do so though. standby.

Now looking at it again I think the inverse index could be even better as it can be used not only in the Merge Join but also in the final Sort:

create index cf_tag_datetime on contfloattable (dateandtime, tagindex)


Related Topics



Leave a reply



Submit