Understanding Bitmap Indexes in Postgresql

Understanding bitmap indexes in postgresql

The bitmap of pages is created dynamically for each query. It is not cached or re-used, and is discarded at the end of the bitmap index scan.

It doesn't make sense to create the page bitmap in advance because its contents depend on the query predicates.

Say you're searching for x=1 and y=2. You have b-tree indexes on x and y. PostgreSQL doesn't combine x and y into a bitmap then search the bitmap. It scans index x for the page address of all pages with x=1 and makes a bitmap where the pages that might contain x=1 are true. Then it scans y looking for the page addresses where y might equal 2, making a bitmap from that. Then it ANDs them to find pages where both x=1 and y=2 might be true. Finally, it scans the table its self, reading only the pages that might contain candidate values, reading each page and keeping only the rows where x=1 and y=2.

Now, if you're looking for something like a cached, pre-built bitmap index, there is such a thing in PostgreSQL 9.5: BRIN indexes. These are intended for very large tables, and provide a way to find ranges of the table that can be skipped over because they're known not to contain a desired value.

When does PostgreSQL automatically create a Bitmap index for a table?

Postgres does not have "Bitmap Indexes".

What you see there is an "index scan" that uses a bitmap while scanning the index.

Tom Lane's answer on the mailing list explains it quite well:

A plain Index Scan fetches one tuple-pointer at a time from the index, and immediately visits that tuple in the table. A bitmap scan fetches all the tuple-pointers from the index in one go, sorts them using an in-memory "bitmap" data structure, and then visits the table tuples in physical tuple-location order.

See also this question https://dba.stackexchange.com/questions/119386 for a more detailed explanation.

Reason to perform Bitmap index scan to Index-only scan?

Until the table has been vacuumed, it would be an index-only scan in name only. Each tuple would have to be verified in the heap. The planner knows that, and penalizes the index only scan accordingly. Once the table has been vacuumed and all the pages set to "all visible", then the planner starts preferring the index-only scan.

Why is Bitmap Scan faster than Index Scan when fetching a moderately large percentage of the table in PostgreSQL?

The backbone of Postgres storage is made up of data pages of 8 kilobytes in a typical installation. Each data page typically holds many tuples. Read the details of physical storage in the manual.

The "bitmap" in a bitmap scan is a way to collect tuple pointers in buckets of data pages. Index sort order is necessarily lost in this process in favor of physical sort order. In "lossy mode" (which only occurs if the result is so huge or workmem so small that even the tiny bitmap won't fit) only block numbers are kept and corresponding tuple indexes discarded.

After that, each data page is only visited once from storage to extract (possibly) multiple tuples and in physical sequence, which also matters for some types of storage. In lossy mode, tuples from each identified page have to be filtered by rechecking the index condition; else tuples can be retrieved directly using collected tuple indexes.

In an index scan, each page may have to be visited multiple times if multiple tuples end up to be stored in the same data page. The actual process is even more sophisticated. Related:

  • Postgres not using index when index scan is much better option
  • Index usage on a temporary table
  • “Recheck Cond:” line in query plans with a bitmap index scan
  • How do I decompose ctid into page and row numbers?
  • Keep PostgreSQL from sometimes choosing a bad query plan

To your questions:

  1. The sorting of the index is lost due to collecting hits and reading them out data page by data page.

  2. Hence, the result has to be sorted again, in an added sort step - if the sort order is required (by ORDER BY, for instance).

  3. The number of data pages that have to be read is the most prominent factor for overall performance. Bitmap index scans reduce that number to a minimum. With faster storage, the benefit of bitmap index scans gets smaller. That's why accurate cost settings are crucial for the query planner to make good decisions.

Is it possible to use bitmap index with more than two possible values?

PostgreSQL does not have a bitmap index, it can do bitmap index scans over regular B-tree indexes.

For that it is not important how many values the indexed column (or columns) can have.

This is how it works:

  • The index is scanned for the search condition.

  • Rather than visiting the table for each row found, PostgreSQL builds a bitmap. This bitmap normally has one bit per table row, and the rows are sorted in physical address (ctid) order. The value of the bit indicates if that row matches the search condition or not (which has nothing to do with the value range of the indexed columns).

    If work_mem is too small to contain a bitmap with one bit per row, PostgreSQL degrades to storing one bit per 8KB page. This shows up as “lossy” entries in the EXPLAIN (ANALYZE) output and will lead to false positive hits in the next step which affects performance.

  • During a second step, the bitmap heap scan, PostgreSQL visits the table and fetches (and re-checks, if necessary) all rows that show a hit in the bitmap.

The advantages of a bitmap index scan are:

  • Even if many rows are selected, PostgreSQL has to visit each table block only once, and the rows are visited in physical order.

  • Several bitmap index scans on the same table can be combined with a “bitmap AND” or a “bitmap OR” before the table is scanned. This can handle ORs efficiently and combine several conditions with AND if each of the conditions alone is not selective enough to warrant an index scan.

What is a bitmap index?

"Bitmap index scan"

Postgres doesn't have "bitmap indexes" per se. A "bitmap index scan" is an index access method that's allowed for certain index types (including default btree indexes). It's especially useful to combine multiple index lookups. The manual:

An index access method can support "plain" index scans, "bitmap" index scans, or both.

You can disable bitmap-scanning (for debugging purposes only!) by setting:

SET enable_bitmapscan = FALSE;

Optimize query performance

With long lists, joining to a derived table is often faster than a lengthy IN expression. You can use VALUES or unnest() for that purpose. Or even a temporary table, possibly with indexes. See:

  • Query table by indexes from integer array
SELECT created_on, sum(pain) AS sum_pain
FROM unnest('{11,12,33,…,5351}'::int[]) AS f(fixin_id)
JOIN bug_snapshots USING (fixin_id)
WHERE status_id IN (2, 7, 5, 3)
AND created_on >= '2013-10-08 16:42:26.994994-0700'::timestamptz
AND created_on <= '2013-11-07 15:42:26.994994-0800'::timestamptz
AND pain < 999
GROUP BY created_on
ORDER BY created_on;

A partial multicolumn index would probably help (a lot). That depends on details like data distribution, load, stable query conditions, etc. Most importantly, the selectivity of WHERE expressions: a partial index typically only makes sense if many or most rows are excluded. Something like:

CREATE INDEX bug_snapshots_part_idx ON bug_snapshots (fixin_id, created_on, pain)
WHERE status_id IN (2, 7, 5, 3)
AND pain < 999;

The sequence of columns in the index matters. That is also true for your primary key, btw, which implements another multicolumn index. See:

  • Is a composite index also good for queries on the first field?

db<>fiddle here

sqlfiddle

Performance testing in fiddles is hardly reliable. Run your own tests! Also there were many improvements to Postgres since 2013, when this answer was written!

timestamp [without time zone]

One more thing: bug_snapshots.created_on is type timestamp. Timestamps are interpreted according to your current time zone setting.

But in the query you try to compare to literals with time zone (timestamptz). This would work with an explicit cast:

WHERE  created_on >= '2013-10-08 16:42:26.994994-0700'::timestamptz

Your literal would be cast do timestamptz and translated to your local time zone accordingly. However, since you don't provide the data type, Postgres casts your literal to the matching type timestamp (not timestamptz) ignoring the time zone offset. Most probably not your intention!

Consider this test:

SELECT min(created_on), max(created_on)
FROM bug_snapshots
WHERE created_on >= '2013-10-08 16:42:26.994994-0700'
AND created_on <= '2013-11-07 15:42:26.994994-0800'

See:

  • Ignoring time zones altogether in Rails and PostgreSQL

What is the difference between Index-Only and Bitmap Index Scan in PostgreSQL?

The Bitmap Index Scan happens when the result set will have a high selectivity rate with respect to the search conditions (i.e., there is a high percentage of rows that satisfy the search criteria). In this case, the planner will plan to scan the entire index, forming a bitmap of which pages on disk to pull the data out from (which happens during the Bitmap Heap Scan step). This is better than a Sequential Scan because it only scans the relevant pages on disk, skipping the pages that it knows relevant data does not exist. Depending on the statistics available to the optimizer, it may not be advantageous to do an Index Scan or an Index-Only Scan, but it is still better than a Sequential Scan.

To complete the answer to the question, an Index-Only Scan is a scan of the index that will pull the relevant data without having to visit the actual table. This is because the relevant data is already in the index. Take, for example, this table:

postgres=# create table foo (id int primary key, name text);
CREATE TABLE
postgres=# insert into foo values (generate_series(1,1000000),'foo');
INSERT 0 1000000

There is an index on the id column of this table, and suppose we call the following query:

postgres=# EXPLAIN ANALYZE SELECT * FROM foo WHERE id < 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Index Scan using foo_pkey on foo (cost=0.42..10.25 rows=104 width=8) (actual time=0.012..1.027 rows=99 loops=1)
Index Cond: (id < 100)
Planning Time: 0.190 ms
Execution Time: 2.067 ms
(4 rows)

This query results in an Index scan because it scans the index for the rows that have id < 100, and then visits the actual table on disk to pull the other columns included in the * portion of the SELECT query.

However, suppose we call the following query (notice SELECT id instead of SELECT *):

postgres=# EXPLAIN ANALYZE SELECT id FROM foo WHERE id < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Only Scan using foo_pkey on foo (cost=0.42..10.25 rows=104 width=4) (actual time=0.019..0.996 rows=99 loops=1)
Index Cond: (id < 100)
Heap Fetches: 99
Planning Time: 0.098 ms
Execution Time: 1.980 ms
(5 rows)

This results in an Index-only scan because only the id column is requested, and that is included (naturally) in the index, so there's no need to visit the actual table on disk to retrieve anything else. This saves time, but its occurrence is very limited.

To answer your question about limiting to 20 results, the limiting occurs after the Bitmap Index Scan has occurred, so the runtime will still be the same whether you limit to 20, 40, or some other value. In the case of an Index/Index-Only Scan, the executor will stop scanning after it has acquired enough rows as specified by the LIMIT clause. In your case, with the Bitmap Heap Scan, this isn’t possible

What is a Bitmap heap scan in a query plan?

The best explanation comes from Tom Lane, which is the algorithm's author unless I'm mistaking. See also the wikipedia article.

In short, it's a bit like a seq scan. The difference is that, rather than visiting every disk page, a bitmap index scan ANDs and ORs applicable indexes together, and only visits the disk pages that it needs to.

This is different from an index scan, where the index is visited row by row in order -- meaning a disk page may get visited multiple times.


Re: the question in your comment... Yep, that's exactly it.

An index scan will go through rows one by one, opening disk pages again and again, as many times as necessary (some will of course stay in memory, but you get the point).

A bitmap index scan will sequentially open a short-list of disk pages, and grab every applicable row in each one (hence the so-called recheck cond you see in query plans).

Note, as an aside, how clustering/row order affects the associated costs with either method. If rows are all over the place in a random order, a bitmap index will be cheaper. (And, in fact, if they're really all over the place, a seq scan will be cheapest, since a bitmap index scan is not without some overhead.)



Related Topics



Leave a reply



Submit