Sqlite: Count Slow on Big Tables

Why is SQLITE taking forever on a simple count query?

I finally figured out the problem. Running a VACUUM command on the database solved the issue. I had run .dbinfo to confirm that the page_size multiplied by the number of pages added up to approximately the file size. That, combined with the fact that I hadn't deleted anything from the database (only inserted), led me to assume that I didn't need to vacuum (or de-fragment).

But it looks like the re-organization that vacuum does also makes a dramatic difference in the speed of the count queries (which now finish in milliseconds as I've seen reported elsewhere).

Count on join of big tables with conditions is slow

Your query, rewritten and 100 % equivalent:

SELECT count(*)
FROM product_categories pc
JOIN customers c USING (organization_id)
JOIN total_sales ts ON ts.customer_id = c.id
JOIN performance_analyses pa ON pa.total_sales_id = ts.id
WHERE pc.organization_id = 3
AND c.active -- boolean can be used directly
AND c.visible
AND ts.product_category_id = pc.id
AND ts.period_id = 193
AND pa.size > 0;

Another answer advises to move all conditions into join clauses and order tables in the FROM list. This may apply for a certain other RDBMS with a comparatively primitive query planner. But while it doesn't hurt for Postgres either, it also has no effect on performance for your query - assuming default server configuration. The manual:

Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN)
is semantically the same as listing the input relations in FROM, so it
does not constrain the join order.

Bold emphasis mine. There is more, read the manual.

The key setting is join_collapse_limit (with default 8). The Postgres query planner will rearrange your 4 tables any way it expects it to be fastest, no matter how you arranged your tables and whether you write conditions as WHERE or JOIN clauses. No difference whatsoever. (The same is not true for some other types of joins that cannot be rearranged freely.)

The important point is that these different join possibilities give
semantically equivalent results but might have hugely different
execution costs. Therefore, the planner will explore all of them to
try to find the most efficient query plan.

Related:

  • Sample Query to show Cardinality estimation error in PostgreSQL
  • A: Slow fulltext search due to wildly inaccurate row estimates

Finally, WHERE id IN (<subquery>) is not generally equivalent to a join. It does not multiply rows on the left side for duplicate matching values on the right side. And columns of the subquery are not visible for the rest of the query. A join can multiply rows with duplicate values and columns are visible.

Your simple subqueries dig up a single unique column in both cases, so there is no effective difference in this case - except that IN (<subquery>) is generally (at least a bit) slower and more verbose. Use joins.

Your query

Indexes

product_categories has 34 rows. Unless you plan on adding many more, indexes do no help performance for this table. A sequential scan will always be faster. Drop index_product_categories_on_organization_id.

customers has 6,970 rows. Indexes start to make sense. But your query uses 4,988 of them according to the EXPLAIN output. Only an index-only scan on an index much less wide than the table could help a bit. Assuming WHERE active AND visible are constant predicates, I suggest a partial multicolumn index:

CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;

I appended id to allow index-only scans. The column is otherwise useless in the index for this query.

total_sales has 7,104,441 rows. Indexes are very important. I suggest:

CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)

Again, aiming for an index-only scan. This is the most important one.

You can delete the completely redundant index index_total_sales_on_product_category_id.

performance_analyses has 1,012,346 rows. Indexes are very important.
I would suggest another partial index with the condition size > 0:

CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;

However:

Rows Removed by Filter: 0"

Seems like this conditions serves no purpose? Are there any rows with size > 0 is not true?

After creating these indexes you need to ANALYZE the tables.

Tables statistics

Generally, I see many bad estimates. Postgres underestimates the number of rows returned at almost every step. The nested loops we see would work much better for fewer rows. Unless this is an unlikely coincidence, your table statistics are badly outdated. You need to visit your settings for autovacuum and probably also per-table settings for your two big tables
performance_analyses and total_sales.

You already did run VACUUM and ANALYZE, which made the query slower, according to your comment. That doesn't make a lot of sense. I would run VACUUM FULL on these two tables once (if you can afford an exclusive lock). Else try pg_repack.

With all the fishy statistics and bad plans I would consider running a complete vacuumdb -fz yourdb on your DB. That rewrites all tables and indexes in pristine conditions, but it's no good to use on a regular basis. It's also expensive and will lock your DBs for an extended period of time!

While being at it, have a look at the cost settings of your DB as well.
Related:

  • Keep PostgreSQL from sometimes choosing a bad query plan
  • Postgres Slow Queries - Autovacuum frequency

Very slow queries in a large database even with indexes

SQLite might be choosing the wrong index. I suspect that level has only a small number of values. That is usually not a good idea for an index. Try making that a composite index:

CREATE INDEX users_level_index_login on users(level, login);

Is count(*) constant time in SQLite, and if not what are alternatives?

SQLite has a special optimization for COUNT(*) without a WHERE clause, where it goes through the table's B-tree pages and counts entries without actually loading the records.
However, this still requires that all the table's data (except overflow pages for large records) is visited, so the runtime is still O(n).

SQLite does not store a separate record count in the database because that would make all changes slower.

Slow Sqlite Select behaviour when trying to select several rows

I think that I've found a solution! I've created an additional index:

create index if not exists ab_b_idx on t1(a, b);

After that selecting both a and b columns runs as fast as selecting only a column.

Explanation of this behaviour I've found in documentation
Covering indices:

If, however, all columns that were to be fetched from the table are already available in the index itself, SQLite will use the values contained in the index and will never look up the original table row. This saves one binary search for each row and can make many queries run twice as fast.

Update: Search on 100 000 000 rows in table t1 took 550ms.



Related Topics



Leave a reply



Submit