How to Partition Postgres Table Using Intermediate Table

Partitioning Postgres table

For a many-to-many relationship you will need a mapping table, partitioning or not.
I wouldn't use an artificial primary key for the mapping table, but the combination of id_doctor and id_patient (they are artificial anyway). The same holds for the appointment table.

Since id_doctor is not part of the patient table (and shouldn't be), you cannot partition the patient table per doctor. Why would you want to do that? Partitioning is mostly useful for mass deletions (and to some extent for speeding up sequential scans) — is that your objective?

There is a wide-spread assumption that bigger tables should be partitioned just because they are big, but that is not the case. Index access to a partitioned table is — if anything — slightly slower than index access to a non-partitioned table. Do you have billions of patients?

postgres bad estimates from partitioned table

The key problem is the same in both queries: the huge underestimate of the join result. In the partitioned case, PostgreSQL materializes intermediate results, which are much bigger than expected and cause temporary files to be written. So increasing work_mem sould speed up that case.

Join row count estimates are hard to estimate correctly, so it will be difficult to cure the problem at the root. You can fight the symptoms though:

  • create indexes on "Z_PROD_STD", "Z_ASTD" and "CT" that include additional columns, so that you get a much faster index only scan

    For example, to speed up the index scan on "CT" that is repeated 2 million times, you could create an index like

    CREATE INDEX ON "CT" ("DATE_SK", "YID");

    and then VACUUM "CT";.

  • alternatively, set enable_nestloop to off for the duration of the query to reduce the impact of the bad estimate

How to implement a many-to-many relationship in PostgreSQL?

The SQL DDL (data definition language) statements could look like this:

CREATE TABLE product (
product_id serial PRIMARY KEY -- implicit primary key constraint
, product text NOT NULL
, price numeric NOT NULL DEFAULT 0
);

CREATE TABLE bill (
bill_id serial PRIMARY KEY
, bill text NOT NULL
, billdate date NOT NULL DEFAULT CURRENT_DATE
);

CREATE TABLE bill_product (
bill_id int REFERENCES bill (bill_id) ON UPDATE CASCADE ON DELETE CASCADE
, product_id int REFERENCES product (product_id) ON UPDATE CASCADE
, amount numeric NOT NULL DEFAULT 1
, CONSTRAINT bill_product_pkey PRIMARY KEY (bill_id, product_id) -- explicit pk
);

I made a few adjustments:

  • The n:m relationship is normally implemented by a separate table - bill_product in this case.

  • I added serial columns as surrogate primary keys. In Postgres 10 or later consider an IDENTITY column instead. See:

    • Safely rename tables using serial primary key columns
    • Auto increment table column
    • https://www.2ndquadrant.com/en/blog/postgresql-10-identity-columns/

    I highly recommend that, because the name of a product is hardly unique (not a good "natural key"). Also, enforcing uniqueness and referencing the column in foreign keys is typically cheaper with a 4-byte integer (or even an 8-byte bigint) than with a string stored as text or varchar.

  • Don't use names of basic data types like date as identifiers. While this is possible, it is bad style and leads to confusing errors and error messages. Use legal, lower case, unquoted identifiers. Never use reserved words and avoid double-quoted mixed case identifiers if you can.

  • "name" is not a good name. I renamed the column of the table product to be product (or product_name or similar). That is a better naming convention. Otherwise, when you join a couple of tables in a query - which you do a lot in a relational database - you end up with multiple columns named "name" and have to use column aliases to sort out the mess. That's not helpful. Another widespread anti-pattern would be just "id" as column name.

    I am not sure what the name of a bill would be. bill_id will probably suffice in this case.

  • price is of data type numeric to store fractional numbers precisely as entered (arbitrary precision type instead of floating point type). If you deal with whole numbers exclusively, make that integer. For example, you could save prices as Cents.

  • The amount ("Products" in your question) goes into the linking table bill_product and is of type numeric as well. Again, integer if you deal with whole numbers exclusively.

  • You see the foreign keys in bill_product? I created both to cascade changes: ON UPDATE CASCADE. If a product_id or bill_id should change, the change is cascaded to all depending entries in bill_product and nothing breaks. Those are just references without significance of their own.

    I also used ON DELETE CASCADE for bill_id: If a bill gets deleted, its details die with it.

    Not so for products: You don't want to delete a product that's used in a bill. Postgres will throw an error if you attempt this. You would add another column to product to mark obsolete rows ("soft-delete") instead.

  • All columns in this basic example end up to be NOT NULL, so NULL values are not allowed. (Yes, all columns - primary key columns are defined UNIQUE NOT NULL automatically.) That's because NULL values wouldn't make sense in any of the columns. It makes a beginner's life easier. But you won't get away so easily, you need to understand NULL handling anyway. Additional columns might allow NULL values, functions and joins can introduce NULL values in queries etc.

  • Read the chapter on CREATE TABLE in the manual.

  • Primary keys are implemented with a unique index on the key columns, that makes queries with conditions on the PK column(s) fast. However, the sequence of key columns is relevant in multicolumn keys. Since the PK on bill_product is on (bill_id, product_id) in my example, you may want to add another index on just product_id or (product_id, bill_id) if you have queries looking for a given product_id and no bill_id. See:

    • PostgreSQL composite primary key
    • Is a composite index also good for queries on the first field?
    • Working of indexes in PostgreSQL
  • Read the chapter on indexes in the manual.

Partitioned table query still scanning all partitions

For non-trivial expressions you have to repeat the more or less verbatim condition in queries to make the Postgres query planner understand it can rely on the CHECK constraint. Even if it seems redundant!

Per documentation:

With constraint exclusion enabled, the planner will examine the
constraints of each partition and try to prove that the partition need
not be scanned because it could not contain any rows meeting the
query's WHERE clause. When the planner can prove this, it excludes
the partition from the query plan.

Bold emphasis mine. The planner does not understand complex expressions.
Of course, this has to be met, too:

Ensure that the constraint_exclusion configuration parameter is not
disabled in postgresql.conf. If it is, queries will not be optimized as desired.

Instead of

SELECT * FROM foo WHERE (id = 2);

Try:

SELECT * FROM foo WHERE id % 30 = 2 AND id = 2;

And:

The default (and recommended) setting of constraint_exclusion is
actually neither on nor off, but an intermediate setting called
partition, which causes the technique to be applied only to queries
that are likely to be working on partitioned tables. The on setting
causes the planner to examine CHECK constraints in all queries, even
simple ones that are unlikely to benefit.

You can experiment with the constraint_exclusion = on to see if the planner catches on without redundant verbatim condition. But you have to weigh cost and benefit of this setting.

The alternative would be simpler conditions for your partitions as already outlined by @harmic.

An no, increasing the number for STATISTICS will not help in this case. Only the CHECK constraints and your WHERE conditions in the query matter.

PostgreSQL - Backup and Restore Database Tables with Partitions

If you did used zip to compress the output, then you should use unzip do uncompress it, not gunzip, they use different formats/algorithms.

I'd suggest you to use gzip and gunzip only. For instance, if you generated a backup named mybackup.sql, you can gzip it with:

gzip mybackup.sql

It will generate a file named mybackup.sql.gz. Then, to restore, you can use:

gunzip -c mybackup.sql.gz | psql -U postgres

Also, I'd suggest you to avoid using pgAdmin to do the dump. Not that it can't do, it is just that you can't automatize it, you can easily use pg_dumpall the same way:

pg_dumpall -U postgres -f mybackup.sql

You can either dump and compress without intermediate files using pipe:

pg_dumpall -U postgres | gzip -c > mybackup.sql.gz

BTW, I'd really suggest you avoiding pg_dumpall and use pg_dump with custom format for each database, as with that you already get the result compressed and easier to use latter. But pg_dumpall is ok for small databases.

Select first row in each GROUP BY group?

On databases that support CTE and windowing functions:

WITH summary AS (
SELECT p.id,
p.customer,
p.total,
ROW_NUMBER() OVER(PARTITION BY p.customer
ORDER BY p.total DESC) AS rank
FROM PURCHASES p)
SELECT *
FROM summary
WHERE rank = 1

Supported by any database:

But you need to add logic to break ties:

  SELECT MIN(x.id),  -- change to MAX if you want the highest
x.customer,
x.total
FROM PURCHASES x
JOIN (SELECT p.customer,
MAX(total) AS max_total
FROM PURCHASES p
GROUP BY p.customer) y ON y.customer = x.customer
AND y.max_total = x.total
GROUP BY x.customer, x.total


Related Topics



Leave a reply



Submit