Timescaledb: Efficiently Select Last Row

How to efficiently get the last timestamp per asset without sequential scan on timescaledb?

Last point query from TimescaleDB's docs:

https://docs.timescale.com/api/latest/analytics/last/

How to improve the performance of timescaledb getting last timestamp

The database has to go to the subindexes of each chunk and retrieve find which is the latest timestamp for timeseries_id=x. The database correctly uses the index (as you can see from the explain) it does an index scan, not a full scan, of each sub-index in each of the chunks. So it does >1000 index scans. No chunks can be pruned because the planner can't know which chunks have the entries for that specific timeseries_id.

And you have 1300 chunks for only 66m records -> ~50k rows per chunk. That's too few rows per chunk. From the Timescale Docs they have the following recommendations:

The key property of choosing the time interval is that the chunk (including indexes) belonging to the most recent interval (or chunks if using space partitions) fit into memory. As such, we typically recommend setting the interval so that these chunk(s) comprise no more than 25% of main memory.

https://docs.timescale.com/latest/using-timescaledb/hypertables#best-practices

Reducing the number of chunks will significantly improve the query performance.

Additionally you might gain even more query performance if you utilize TimescaleDB compression, which will reduce the number of chunks required to be scanned even more, you could segment by the timeseries_id (https://docs.timescale.com/latest/api#compression) Or you could define a continuous aggregate that will hold the last item per timeseries_id (https://docs.timescale.com/latest/api#continuous-aggregates)

Timescaledb: get first value before and after range

You could use subqueries to achieve this: (example query with range of '2021-04-01 04:04:00' - '2021-04-01 04:07:00')

SELECT time FROM table
WHERE
(time >= timestamp '2021-04-01 04:04:00' AND time <= timestamp '2021-04-01 04:07:00')
OR time = (SELECT time FROM table WHERE time < timestamp '2021-04-01 04:04:00' ORDER BY time DESC LIMIT 1)
OR time = (SELECT time FROM table WHERE time > timestamp '2021-04-01 04:07:00' ORDER BY time ASC LIMIT 1)
ORDER BY time

Using Timescale to find the latest value per interval

Found a nice working solution to my problem.
It involves four main steps:

  1. getting latest values
    select 
time_bucket('1 second', time + '1 second') as interval,
last(val, db_id) as last_value
from table
where time > <date_start> and time < <date_end>
group by interval
order by time;

This will produce a table that has the latest values. last also takes advantage of a column in case another level of sorting is required.
e.g.

time                 last_value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95 |
2020-01-01 08:43:26 | 160.00 |

Note that I shift the time by one second with + '1 second' since I only want data before a particular second - without this it will consider on-the-second data as part of the last price.


  1. creating a table with timestamps per second
    select 
time_bucket_gapfill('1 second', time) as per_second
from table
where time > <date_start> and time < <date_end>
group by per_second
order by per_second;

Here I produce a table where each row has per second timestamps.

e.g.

per_second
2020-01-01 00:00:00.000
2020-01-01 00:00:01.000
2020-01-01 00:00:02.000
2020-01-01 00:00:03.000
2020-01-01 00:00:04.000
2020-01-01 00:00:05.000

  1. join them together and add a value_partition column
select
per_second,
last_value,
sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
from
(
select
time_bucket('1 second', time + '1 second') as interval,
last(val, db_id) as last_value
from table
where time > <date_start> and time < <date_end>
group by interval, time
) a
right join
(
select
time_bucket_gapfill('1 second', time) as per_second
from table
where time > <date_start> and time < <date_end>
group by per_second
) b
on a.interval = b.per_second

Inspired by this answer, the goal is to have a counter (value_partition) that increments only if the value is not null.

e.g.

per_second              latest_value value_partition
2020-01-01 00:00:00.000 NULL 0
2020-01-01 00:00:01.000 15.82 1
2020-01-01 00:00:02.000 NULL 1
2020-01-01 00:00:03.000 NULL 1
2020-01-01 00:00:04.000 NULL 1
2020-01-01 00:00:05.000 NULL 1
2020-01-01 00:00:06.000 NULL 1
2020-01-01 00:00:07.000 NULL 1
2020-01-01 00:00:08.000 NULL 1
2020-01-01 00:00:09.000 NULL 1
2020-01-01 00:00:10.000 15.72 2
2020-01-01 00:00:10.000 14.67 3


  1. filling in the null values
select
per_second,
first_value(last_value) over (partition by value_partition order by per_second) as latest_value
from
(
select
per_second,
last_value,
sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
from
(
select
time_bucket('1 second', time + '1 second') as interval,
last(val, db_id) as last_value
from table
where time > <date_start> and time < <date_end>
group by interval
) a
right join
(
select
time_bucket_gapfill('1 second', time) as per_second
from table
where time > <date_start> and time < <date_end>
group by per_second
) b
on a.interval = b.per_second
) as q

This final step brings everything together.
This takes advantage of the value_partition column and overwrites the null values accordingly.

e.g.

per_second              latest_value
2020-01-01 00:00:00.000 NULL
2020-01-01 00:00:01.000 15.82
2020-01-01 00:00:02.000 15.82
2020-01-01 00:00:03.000 15.82
2020-01-01 00:00:04.000 15.82
2020-01-01 00:00:05.000 15.82
2020-01-01 00:00:06.000 15.82
2020-01-01 00:00:07.000 15.82
2020-01-01 00:00:08.000 15.82
2020-01-01 00:00:09.000 15.82
2020-01-01 00:00:10.000 15.72
2020-01-01 00:00:10.000 14.67



Related Topics



Leave a reply



Submit