Timescaledb: Efficiently Select Last Row

How to efficiently get the last timestamp per asset without sequential scan on timescaledb?

Last point query from TimescaleDB's docs:

https://docs.timescale.com/api/latest/analytics/last/

How to improve the performance of timescaledb getting last timestamp

The database has to go to the subindexes of each chunk and retrieve find which is the latest timestamp for timeseries_id=x. The database correctly uses the index (as you can see from the explain) it does an index scan, not a full scan, of each sub-index in each of the chunks. So it does >1000 index scans. No chunks can be pruned because the planner can't know which chunks have the entries for that specific timeseries_id.

And you have 1300 chunks for only 66m records -> ~50k rows per chunk. That's too few rows per chunk. From the Timescale Docs they have the following recommendations:

The key property of choosing the time interval is that the chunk (including indexes) belonging to the most recent interval (or chunks if using space partitions) fit into memory. As such, we typically recommend setting the interval so that these chunk(s) comprise no more than 25% of main memory.

https://docs.timescale.com/latest/using-timescaledb/hypertables#best-practices

Reducing the number of chunks will significantly improve the query performance.

Additionally you might gain even more query performance if you utilize TimescaleDB compression, which will reduce the number of chunks required to be scanned even more, you could segment by the timeseries_id (https://docs.timescale.com/latest/api#compression) Or you could define a continuous aggregate that will hold the last item per timeseries_id (https://docs.timescale.com/latest/api#continuous-aggregates)

Timescaledb: get first value before and after range

You could use subqueries to achieve this: (example query with range of '2021-04-01 04:04:00' - '2021-04-01 04:07:00')

SELECT time FROM table
WHERE 
    (time >= timestamp '2021-04-01 04:04:00' AND time <= timestamp '2021-04-01 04:07:00')
    OR time = (SELECT time FROM table WHERE time < timestamp '2021-04-01 04:04:00' ORDER BY time DESC LIMIT 1)
    OR time = (SELECT time FROM table WHERE time > timestamp '2021-04-01 04:07:00' ORDER BY time ASC LIMIT 1)
ORDER BY time

Using Timescale to find the latest value per interval

Found a nice working solution to my problem.
It involves four main steps:

getting latest values

    select 
        time_bucket('1 second', time + '1 second') as interval,
        last(val, db_id) as last_value
    from table
    where time  > <date_start> and time < <date_end>
    group by interval
    order by time;

This will produce a table that has the latest values. last also takes advantage of a column in case another level of sorting is required.
e.g.

time                 last_value
2020-01-01 08:39:23 | 197.95 |
2020-01-01 08:40:38 | 57.95  | 
2020-01-01 08:43:26 | 160.00 |

Note that I shift the time by one second with + '1 second' since I only want data before a particular second - without this it will consider on-the-second data as part of the last price.

creating a table with timestamps per second

    select 
        time_bucket_gapfill('1 second', time) as per_second
    from table
    where time  > <date_start> and time < <date_end>
    group by per_second
    order by per_second;

Here I produce a table where each row has per second timestamps.

e.g.

per_second
2020-01-01 00:00:00.000
2020-01-01 00:00:01.000
2020-01-01 00:00:02.000
2020-01-01 00:00:03.000
2020-01-01 00:00:04.000
2020-01-01 00:00:05.000

join them together and add a value_partition column

select
    per_second,
    last_value,
    sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
from
    (
        select 
            time_bucket('1 second', time + '1 second') as interval,
            last(val, db_id) as last_value
        from table
        where time  > <date_start> and time < <date_end>
        group by interval, time
    ) a
right join
    (
        select 
            time_bucket_gapfill('1 second', time) as per_second
        from table
        where time  > <date_start> and time < <date_end>
        group by per_second
    ) b
on a.interval = b.per_second

Inspired by this answer, the goal is to have a counter (value_partition) that increments only if the value is not null.

e.g.

per_second              latest_value value_partition
2020-01-01 00:00:00.000 NULL         0         
2020-01-01 00:00:01.000 15.82        1         
2020-01-01 00:00:02.000 NULL         1         
2020-01-01 00:00:03.000 NULL         1         
2020-01-01 00:00:04.000 NULL         1         
2020-01-01 00:00:05.000 NULL         1         
2020-01-01 00:00:06.000 NULL         1         
2020-01-01 00:00:07.000 NULL         1         
2020-01-01 00:00:08.000 NULL         1         
2020-01-01 00:00:09.000 NULL         1         
2020-01-01 00:00:10.000 15.72        2 
2020-01-01 00:00:10.000 14.67        3

filling in the null values

select
    per_second,
    first_value(last_value) over (partition by value_partition order by per_second) as latest_value
from
(
    select
        per_second,
        last_value,
        sum(case when last_value is null then 0 else 1 end) over (order by per_second) as value_partition
    from
    (
            select 
                time_bucket('1 second', time + '1 second') as interval,
                last(val, db_id) as last_value
            from table
            where time  > <date_start> and time < <date_end>
            group by interval
        ) a
    right join
        (
            select 
                time_bucket_gapfill('1 second', time) as per_second
            from table
            where time  > <date_start> and time < <date_end>
            group by per_second
        ) b
    on a.interval = b.per_second
) as q

This final step brings everything together.
This takes advantage of the value_partition column and overwrites the null values accordingly.

e.g.

per_second              latest_value
2020-01-01 00:00:00.000 NULL        
2020-01-01 00:00:01.000 15.82       
2020-01-01 00:00:02.000 15.82       
2020-01-01 00:00:03.000 15.82       
2020-01-01 00:00:04.000 15.82       
2020-01-01 00:00:05.000 15.82       
2020-01-01 00:00:06.000 15.82       
2020-01-01 00:00:07.000 15.82       
2020-01-01 00:00:08.000 15.82       
2020-01-01 00:00:09.000 15.82       
2020-01-01 00:00:10.000 15.72       
2020-01-01 00:00:10.000 14.67

Timescaledb: Efficiently Select Last Row

How to efficiently get the last timestamp per asset without sequential scan on timescaledb?

How to improve the performance of timescaledb getting last timestamp

Timescaledb: get first value before and after range

Using Timescale to find the latest value per interval

Related Topics

Leave a reply