How to Use Time-Series with Sqlite, with Fast Time-Range Queries

How to use time-series with Sqlite, with fast time-range queries?

First solution

The method (2) detailed in the question seems to work well. In a benchmark, I obtained:

naive method, without index: 18 MB database, 86 ms query time
naive method, with index: 32 MB database, 12 ms query time
method (2): 18 MB database, 12 ms query time

The key point is here to use dt as an INTEGER PRIMARY KEY, so it will be the row id itself (see also Is an index needed for a primary key in SQLite?), using a B-tree, and there will not be another hidden rowid column. Thus we avoid an extra index which would make a correspondance dt => rowid: here dt is the row id.

We also use AUTOINCREMENT which internally creates a sqlite_sequence table, which keeps track of the last added ID. This is useful when inserting: since it is possible that two events have the same timestamp in seconds (it would be possible even with milliseconds or microseconds timestamps, the OS could truncate the precision), we use the maximum between timestamp*10000 and last_added_ID + 1 to make sure it's unique:

 MAX(?, (SELECT seq FROM sqlite_sequence) + 1)

Code:

import sqlite3, random, time
db = sqlite3.connect('test.db')
db.execute("CREATE TABLE data(dt INTEGER PRIMARY KEY AUTOINCREMENT, label TEXT);")

t = 1600000000
for i in range(1000*1000):
    if random.randint(0, 100) == 0:  # timestamp increases of 1 second with probability 1%
        t += 1
    db.execute("INSERT INTO data(dt, label) VALUES (MAX(?, (SELECT seq FROM sqlite_sequence) + 1), 'hello');", (t*10000, ))
db.commit()

# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000*10000, 1600005100*10000  # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
    pass
print(time.time()-start)

Using a `WITHOUT ROWID` table

Here is another method with WITHOUT ROWID which gives a 8 ms query time. We have to implement an auto-incrementing id ourself, since AUTOINCREMENT is not available when using WITHOUT ROWID.

WITHOUT ROWID is useful when we want to use a PRIMARY KEY(dt, another_column1, another_column2, id) and avoid to have an extra rowid column. Instead of having one B-tree for rowid and one B-tree for (dt, another_column1, ...), we'll have just one.

db.executescript("""
    CREATE TABLE autoinc(num INTEGER); INSERT INTO autoinc(num) VALUES(0);

    CREATE TABLE data(dt INTEGER, id INTEGER, label TEXT, PRIMARY KEY(dt, id)) WITHOUT ROWID;
    
    CREATE TRIGGER insert_trigger BEFORE INSERT ON data BEGIN UPDATE autoinc SET num=num+1; END;
    """)

t = 1600000000
for i in range(1000*1000):
    if random.randint(0, 100) == 0: # timestamp increases of 1 second with probabibly 1%
        t += 1
    db.execute("INSERT INTO data(dt, id, label) VALUES (?, (SELECT num FROM autoinc), ?);", (t, 'hello'))
db.commit()

# t will range in a ~ 10 000 seconds window
t1, t2 = 1600005000, 1600005100  # time range of width 100 seconds (i.e. 1%)
start = time.time()
for _ in db.execute("SELECT 1 FROM data WHERE dt BETWEEN ? AND ?", (t1, t2)): 
    pass
print(time.time()-start)

Roughly-sorted UUID

More generally, the problem is linked to having IDs that are "roughly-sorted" by datetime. More about this:

ULID (Universally Unique Lexicographically Sortable Identifier)
Snowflake
MongoDB ObjectId

All these methods use an ID which is:

[---- timestamp ----][---- random and/or incremental ----]

How to query rows by intraday time range in sqlite

First, keep the sqlite Date and Time Functions doc handy.

Something like:

WHERE strftime("%H:%M",yourdate) between "10:00" and "13:00"

should accomplish the task.

Select data within a time period by Sqlite

SELECT TimeStamp,strftime('%Y',TimeStamp) as "Year",
strftime('%m',TimeStamp) as "Month",
strftime('%d',TimeStamp) as "Day",
strftime('%M',TimeStamp) as "Minute",
strftime('%H', TimeStamp) as "Hours"
FROM temp
WHERE cast(strftime('%H', TimeStamp) as int) between 20 and 21

strftime returns a string, you compare to ints. cast the string as int to solve that

Column of increasing integers: why is an index needed to speed up the query?

The short answer is that sqlite doesn't know that your table is sorted by column t. This means it has to scan through the whole table to extract the data.

When you add an index on the column, the column t is sorted in the index, so it can skip the first million rows and then stream the rows of interest to you. You are extracting 20% of the rows, and it returns in 15 ms / 73 ms = 21% of the time. If that fraction is smaller, the benefit you derive from the index is larger.

If the column t is unique, then consider using that column as the primary key, as you would get the index for "free". If you can bound the number of rows with the same t, then you use (t, offset) as primary key where offset might be a tinyint. The point being that size(primary key index) + size(t index) would larger than size(t+offset index). If t was in ms or ns instead of s, it might be unique in practice, or you could fiddle with it when it was not (and just truncate to second resolution when you need the data).

If you don't need a primary key (as unique index), leave it out and just have the non-unique index on t. Without a primary key you can identify a unique row by rowid, or if all the columns collectively create an unique row. If you create the table without rowid, you could still use limit to operate on identical rows.

You could use database warehouse techniques, if you don't need per record data, store it in a less granular fashion (record per minute, or per hour and group_concat the text column).

Finally, there are databases that are optimized for time-series data. They may, for instance, only allow you to remove the oldest data or append new data but not make any changes. This would allow such as system to store the data pre-sorted (mysql, btw, call this feature index ordered table). As the data cannot change, such a database my run-length or delta compress data by column so it only stores differences between rows.

How to select fix number of datapoints spread evenly over a time range

You can use NTILE() window function to divide the resultset in 50 groups, based on the column Timestamp and then with aggregation pick 1 row from each group with MAX() or MIN() aggregate function:

WITH cte AS (
  SELECT *, NTILE(50) OVER (ORDER BY Timestamp) nt 
  FROM Battery2 
  WHERE Timestamp >= datetime('now', '-1 hour')
)
SELECT MAX(Timestamp) AS Timestamp, Voltage
FROM cte
GROUP BY nt;

Can this SQLite query be made much faster?

You need a clustered index, or if you are using a version of SQLite which doesn't support one, a covering index.

Sqlite 3.8.2 and above

Use this in SQLite 3.8.2 and above:

create table recording (
  camera_id integer references camera (id) not null,

  sample_file_bytes integer not null check (sample_file_bytes > 0),

  -- The starting time of the recording, in 90 kHz units since
  -- 1970-01-01 00:00:00 UTC.
  start_time_90k integer not null check (start_time_90k >= 0),

  -- The duration of the recording, in 90 kHz units.
  duration_90k integer not null
      check (duration_90k >= 0 and duration_90k < 5*60*90000),

  video_samples integer not null check (video_samples > 0),
  video_sync_samples integer not null check (video_samples > 0),
  video_sample_entry_id integer references video_sample_entry (id),

  --- here is the magic
  primary key (camera_id, start_time_90k)
) WITHOUT ROWID;

Earlier Versions

In earlier versions of SQLite you can use this sort of thing to create a covering index. This should allow SQLite to pull the data values from the index, avoiding fetching a separate page for each row:

create index recording_camera_start on recording (
     camera_id, start_time_90k,
     sample_file_bytes, duration_90k, video_samples, video_sync_samples, video_sample_entry_id
 );

Discussion

The cost is likely to be IO (regardless of that you said it wasn't) because recall that IO requires CPU as data must be copied to and from the bus.

Without a clustered index, rows are inserted with a rowid, and may not be in any sensible order. That means that for each 26 byte row you request, the system may have to fetch a 4KB page from the SD card - which is a lot of overhead.

With a limit of 8 cameras, a simple clustered index on id to ensure they appear on disk in inserted order would probably give you about 10x speed increase by ensuring that the fetched page contains the next 10-20 rows which are going to be required.

A clustered index on both camera and time should ensure that each page fetched contains 100 or more rows.

Match whole day in SQLite Timestamp

When doing date range queries when the field has a time component, always do this:

where YourDateField >= the first day of your range
and YourDateField < the day after the the last day of your range.

Everything else is a detail. Also, as long as you have a date or datetime datatype, don't worry about the format. The only time it's important is for displaying.

It may look inefficient because you have to do some extra typing, but the queries will run faster than using functions like this:

where somefunction(YourDateField) = something.

SQLite Time range filter?

You can try: strftime("%H:%M", time_string).

So with your code: WHERE strftime( "%H:%M", initial_time ) >= strftime( "%H:%M", my_init_value ) AND strftime( "%H:%M", final_time ) <= strftime( "%H:%M", my_final_value )

Look Here for more information

How to Use Time-Series with Sqlite, with Fast Time-Range Queries