How to Convert a Bunch of Boolean Columns to a Single Bitmap in Postgresql

Can I convert a bunch of boolean columns to a single bitmap in PostgreSQL?

For a bitmask, the type bitstring would be the better choice. Could look like this then:

SELECT BoolD::int::bit
|| BoolC::int::bit
|| BoolB::int::bit
|| BoolA::int::bit
FROM tbl;

true converts to 1, false to 0. You can simply concatenate bits to a bitstring.

Cast bit(n) to integer

It seems you need an integer as result - there is a simple & fast way:

SELECT (BoolD::int::bit
|| BoolC::int::bit
|| BoolB::int::bit
|| BoolA::int::bit)::bit(4)::int
FROM tbl;

Be sure to read the fine print in the chapter "Bit String Functions and Operators" of the manual.


I came up with two more ideas and put together a quick test / reference with 10k rows to sum it all up.

Test setup:

CREATE TEMP TABLE t (boola bool, boolb bool, boolc bool, boold bool);
INSERT INTO t
SELECT random()::int::bool
, random()::int::bool
, random()::int::bool
, random()::int::bool
FROM generate_series(1,10000);

Demo:

SELECT  CASE WHEN boold THEN 1 ELSE 0 END
+ (CASE WHEN boolc THEN 1 ELSE 0 END << 1)
+ (CASE WHEN boolb THEN 1 ELSE 0 END << 2)
+ (CASE WHEN boola THEN 1 ELSE 0 END << 3) AS andriy

, boold::int
+ (boolc::int << 1)
+ (boolb::int << 2)
+ (boola::int << 3) AS mike

, (boola::int::bit
|| boolb::int::bit
|| boolc::int::bit
|| boold::int::bit)::bit(4)::int AS erwin1

, boold::int
| (boolc::int << 1)
| (boolb::int << 2)
| (boola::int << 3) AS erwin2

, (((
boola::int << 1)
| boolb::int << 1)
| boolc::int << 1)
| boold::int AS erwin3
FROM t
LIMIT 15;

You could also use | (bitwise OR) instead of the + operator.

Individual test runs show basically the same performance for all five methods.

Performance Tuning: Create index for boolean column

For a query like this, a partial index covering only unsynced rows would serve best.

CREATE INDEX ON tbl (id) WHERE sync_done = FALSE;

However, for a use case like this, other synchronization methods may be preferable to begin with:

  • Have a look at LISTEN / NOTIFY.
  • Or use a trigger in combination with dblink or a foreign data wrapper like postgres_fdw (preferably).
  • Or one of the many available replication methods.

    Streaming Replication was added with Postgres 9.0 and has become increasingly popular.

Boolean column in index and filter sections of explain in postgres

UPDATED

I cannot see anything wrong with this EXPLAIN except for the fact the you are indexing boolean column (obviously, column has low cardinality field). It might be beneficial to use Partial Index with definition something like:

CREATE INDEX ON yourtable WHERE is_woman = FALSE;

As for the question itself, you have a query with WHERE ... condition. Postgres planner/optimizer decided to use woman_idx index scan instead of sequential one - Index Cond is indicated for index scan.

If you can see Filter statement it means that the plan node checks the condition for each row it scans (in our case each woman_idx scan), and outputs only the ones that pass the condition. For details check EXPLAIN documentation.

Index on boolean columns in time dimension

If is_current_month = true represents more than a few percent of the rows then the index will not be used. 7,000 rows is too few to even bother.

Is there any difference between integer and bit(n) data types for a bitmask?

If you only have a few variables I would consider keeping separate boolean columns.

  • Indexing is easy. In particular also indexes on expressions and partial indexes.
  • Conditions for queries are easy to write and read and meaningful.
  • A boolean column occupies 1 byte (no alignment padding). For only a few variables this occupies the least space.
  • Unlike other options boolean columns allow NULL values for individual bits if you should need that. You can always define columns NOT NULL if you don't.

If you have more than a hand full variables but no more than 32, an integer column may serve best. (Or a bigint for up to 64 variables.)

  • Occupies 4 bytes on disk (may require alignment padding, depending on preceding columns).
  • Very fast indexing for exact matches ( = operator).
  • Handling individual values may be slower / less convenient than with varbit or boolean.

With even more variables, or if you want to manipulate the values a lot, or if you don't have huge tables or disk space / RAM is not an issue, or if you are not sure what to pick, I would consider bit(n) or bit varying(n) (short: varbit(n).

  • Occupies at least 5 bytes (or 8 for very long strings) plus 1 byte for each group of 8 bits (rounded up).
  • You can use bit string functions and operators directly, and some standard SQL functions as well.

For just 3 bits of information, individual boolean columns get by with 3 bytes, an integer needs 4 bytes (maybe additional alignment padding) and a bit string 6 bytes (5 + 1).

For 32 bits of information, an integer still needs 4 bytes (+ padding), a bit string occupies 9 bytes for the same (5 + 4) and boolean columns occupy 32 bytes.

To optimize disk space further you need to understand the storage mechanisms of PostgreSQL, especially data alignment. More in this related answer.

This answer on how to transform the types boolean, bit(n) and integer may be of help, too.

Change varchar to boolean in PostgreSQL

PostgreSQL (unlike Oracle) has a fully-fledged boolean type. Generally, a "yes/no flag" should be boolean. That's the appropriate type!

What about size/storage?

A boolean column occupies 1 byte on disk.

(The manual) about text or character varying:

the storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string

That's at least 2 bytes for a single character.

Actual storage is more complicated than that. There is some fixed overhead per table, page and row, there is special NULL storage and some types require data alignment. See:

  • How many records can I store in 5 MB of PostgreSQL on Heroku?

Encoding UTF8 doesn't make any difference here. Basic ASCII-characters are bit-compatible with other encodings like LATIN-1.

In your case, according to your description, you should keep the NOT NULL constraint you already have - independent of the data type.

Query performance?

Will be slightly better in any case with boolean. Besides being smaller, the logic for boolean is simpler and varchar or text are also generally burdened with COLLATION rules. But don't expect much for something that simple.

Instead of:

WHERE consistency = 'Y'

You could write:

WHERE consistency = true

But rather simplify to just:

WHERE consistency

No further evaluation needed.

Change type

Transforming your table is simple:

ALTER TABLE tbl ALTER consistency TYPE boolean
USING CASE consistency WHEN 'Y' THEN true ELSE false END;

This CASE expression folds everything that is not TRUE ('Y') to FALSE. The NOT NULL constraint just stays.

Matching bitmasks using bitstrings (instead of ints) in SQL

One way would be to just use a bit string on the right side of the expression, too:

WITH test (id, username, roles) AS (
VALUES
(1,'Dave',B'001')
,(2,'Charlie',B'011')
,(3,'Susan',B'101')
,(4,'Nick',B'110')
)
SELECT *, (roles & B'001') AS intersection
FROM test
WHERE (roles & B'001') <> B'000';

Or you can cast an integer 0 to bit(3)

...
WHERE (roles & B'001') <> 0::bit(3);

You may be interested in this related answer that demonstrates a number of ways to convert between boolean, bit string and integer:

Can I convert a bunch of boolean columns to a single bitmap in PostgreSQL?

Be aware that storing the data as integer can save some space. integer needs 4 bytes for up to 32 bit of information, while - I quote the manual at said location:

A bit string value requires 1 byte for each group of 8 bits, plus 5 or
8 bytes overhead
depending on the length of the string [...]



Related Topics



Leave a reply



Submit