Postgres Not Using Index When Index Scan Is Much Better Option

Why this query is not using index only scan in postgresql


autovacuum is not running

PostgreSQL index-only scans require some information about which rows are "visible" to current transactions - i.e. not deleted, not old versions of updated rows, and not uncommitted inserts or new versions of updates.

This information is kept in the "visibility map".

The visibility map is maintained by VACUUM, usually in the background by autovacuum workers.

If autovacuum is not keeping up with write activity well, or if autovacuum has been disabled, then index-only scans probably won't be used because PostgreSQL will see that the visibility map does not have data for enough of the table.

Turn autovaccum back on. Then manually VACUUM the table to get it up to date immediately.

BTW, in addition to visibility map information, autoVACUUM can also write hint-bit information that can make SELECTs of recently inserted/updated data faster.

Autovacuum also maintains table statistics that are vital for effective query planning. Turning it off will result in the planner using increasingly stale information.

It is also absolutely vital for preventing an issue called transaction-ID wrap-around, which is an emergency condition that can cause the whole database to go into emergency shut-down until a time-consuming whole-table VACUUM is performed.

Do not turn autovacuum off.

As for why it's sometimes using an index-only scan and sometimes not, a few possibilities:

  • The current random_page_cost setting makes it think that random I/O will be slower than it really is, so it tries harder to avoid it;

  • The table statistics, especially the limit values, are outdated. So it doesn't realise that there's a good chance the value being looked for will be discovered quickly in an index-only scan;

  • The visibility map is outdated, so it thinks an index-only scan will find too many values that will require heap fetches to check, making it slower than other methods especially if it thinks the proportion of values likely to be found is high.

Most of these issues are fixed by leaving autovacuum alone. In fact, on frequently appended tables you should set autovacuum to run much more often than the default so it updates the limit statistics more. (Doing that helps work around PostgreSQL's planner issues with tables where the most frequently queried data is the most recently inserted with an incrementing ID or timestamp that means the most-desired values are never in the table histograms and limit stats).

Go turn autovacuum back on - then turn it up.

PostgreSQL query not using index in production

Disclaimer

I have used PostgreSQL very little. I'm answering based on my knowledge of SQL Server index usage and execution plans. I ask the PostgreSQL gods for mercy if I get something wrong.

Query Optimizers are Dynamic

You said your query plan has changed from your development to production environments. This is to be expected. Query optimizers are designed to generate the optimum execution plan based on the current data conditions. Under different conditions the optimizer may decide it is more efficient to use a table scan vs an index scan.

When would it be more efficient to use a table scan vs an index scan?

SELECT A, B
FROM someTable
WHERE A = 'SOME VALUE'

Let's say you have a non-clustered index on column A. In this case you are filtering on column A, which could potentially take advantage of the index. This would be efficient if the index is selective enough - basically, how many distinct values make up the index? The database keeps statistics on this selectivity info and uses these statistics when calculating costs for execution plans.

If you have a million rows in a table, but only 10 possible values for A, then your query would likely return about 100K rows. Because the index is non-clustered, and you are returning columns not included in the index, B, a lookup will need to be performed for each row returned. These look-ups are random-access lookups which are much more expensive then sequential reads used by a table scan. At a certain point it becomes more efficient for the database to just perform a table scan rather than an index scan.

This is just one scenario, there are many others. It's hard to know without knowing more about what your data is like, what your indexes look like and how you are trying to access the data.

To answer the original question:

Would PostgreSQL abstain from using indexes if they (or the table) are too big? No. It is more likely that in the way that you are accessing the data, it is less efficient for PostgreSQL to use the index vs using a table scan.

The PostgreSQL FAQ touches on this very subject (see: Why are my queries slow? Why don't they use my indexes?): https://wiki.postgresql.org/wiki/FAQ#Why_are_my_queries_slow.3F_Why_don.27t_they_use_my_indexes.3F

Postgres not using index on two comparison operator

There is not enough info in the question to really know for sure but here are some tips:

The filters you have in the query are very inclusive:
date_of_birth < '2020-01-01' will most likely match all the rows as there will be only a few 5 month old babies that own a phone.
phone_number < '989898' will also match most of the rows.

Postgress knows that you are asking it for (almost) the full table and in this case seq scan is faster. This is because index is helpful to pick which pages to read from the disc. But there is cost with using an index. So there is no point in using the index if you already know you are reading all of them.

And indeed here postgress knows you are reading the full table: (cost=0.00..301.00 rows=1000 width=14) and that is why it chooses seq scan as it will be faster. If you create a more exclusive filter like phone phone_number < '11' (depending in your data distribution of course!) you should see an index scan.

Postgress has internal statistics about each column, when creating an execution plan it will estimate the number of rows that will be returned for the query. The statistics are not perfect and Postgress assumes that columns are independent. This is by design to provide the best mix of time-to-plan vs power. So if it assumes that filter1 matches 0,1 rows and filter2 matches 0,01 rows it will assume that the number of rows returned will be 0,1*0,001*number_of_rows. There is also a number of other statistics available and used. Based on this Postgress makes a decision if it will be more beneficial to do a seq scan or use an index (and which index).

In this case Postgress needs to do a seq scan as it needs to go to the disc to fetch first_name, last_name columns as those are not included in the index(es).

A way to have a faster query (depending on you usage pattern!) is to create a covering index. You have 4 columns involved in the query:
first_name, last_name, date_of_birth, phone_number. If you create an index like:
btree (date_of_birth, phone_number, first_name, last_name) Postgress will be able to always run an index-only scan for this query and never use the disc. But mind that this index can get large and it will only work if you can fit it in memory. So be careful with that.

You did not add which Postgress version you are using but starting with 11 (if I remember correctly, for sure more than 10) you are able to INCLUDE columns in the indexes. This is a very cool new feature. If you always filter only on phone number and day of birth you could do for example:
btree (date_of_birth, phone_number) INCLUDE (first_name, last_name) and get index-only scans here with a smaller index.

If this filter on phone_number and date_of_birth is a very common one you can consider creating a compound statistic on both columns. That should allow Postgress to create better query plans. This will not change anything in this case as this plan with seq scan is already optimal but may help with different filter values.

These two tips will depend on the type of the columns which was not added to the question:

If you have a table like date_of_birth it may be beneficial to look into a BRIN index.

Also mind that with time columns asking date_of_birth < '2020-01-01' means you are asking for all people born from 2020 to the begining of time :) Depending on the column type it MAY be beneficial to provide a lower bound ex. date_of_birth < '2020-01-01' AND date_of_birth > '1900-01-01'. But you will need to test this on a large dataset to see if you do see a difference.

How to make postgres not use a particular index?

You can probably disable the index by doing some dummy arithmetic on the indexed column.

 ...AND "chaindata_tokentransfer"."chain_id" + 0 = 1...

If you put that into production, make sure to add a code comment on why you are doing such an odd thing.

I'm curious why it chooses to use that index, despite apparently knowing how astonishingly awful it is. If you show the plan for the query with the index disabled, maybe we could figure that out.

If the dummy arithmetic doesn't work, what you could do is start a transaction, drop the index, execute the query (or the just the EXPLAIN of it), then rollback the drop. That is probably not something you want to do often in production (especially since the table will be locked from when the index is dropped until the rollback. Also because you might accidentally commit!) but getting the plan is probably worth doing it once.



Related Topics



Leave a reply



Submit