Sample Query to Show Cardinality Estimation Error in Postgresql

Sample Query to show Cardinality estimation error in PostgreSQL

This is to answer the comment by @Twelfth as well as the question itself.

Three quotes from this chapter in the manual:

"Controlling the Planner with Explicit JOIN Clauses"

Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN)
is semantically the same as listing the input relations in FROM, so it
does not constrain the join order.

...

To force the planner to follow the join order laid out by explicit
JOINs, set the join_collapse_limit run-time parameter to 1. (Other
possible values are discussed below.)

...

Constraining the planner's search in this way is a useful technique
both for reducing planning time and for directing the planner to a
good query plan.

Bold emphasis mine. Conversely, you can abuse the same to direct the query planner to a bad query plan for your testing purposes. Read the whole manual page. It should be instrumental.

Also, you can force nested loops by disabling alternative methods one by one (best in your session only). Like:

SET enable_hashjoin = off;

Etc.

About checking and setting parameters:

Query a parameter (postgresql.conf setting) like "max_connections"

Force actual estimation errors

One obvious way would be to disable autovacuum and add / remove rows from the table. Then the query planner is working with outdated statistics. Note that some other commands update statistics as well.

Statistics are stored in the catalog tables pg_class and pg_statistics.

SELECT * FROM pg_class WHERE oid = 'mytable'::regclass;
SELECT * FROM pg_statistic WHERE starelid = 'mytable'::regclass;

This leads me to another option. You could forge entries in these two tables. Superuser privileges required.

You don't strike me as a newcomer, but a warning for the general public: If you break something in the catalog tables, your database (cluster) might go belly-up. You have been warned.

How to optimize query postgres

Try this rewritten version:

SELECT fat.*   
FROM   Table1 fat
JOIN   conciliacao_vendas cv USING (empresa_id, chavefato, rede_id)
JOIN   loja lj               ON lj.id = fat.loja_id  
JOIN   rede rd               ON rd.id = fat.rede_id  
JOIN   bandeira bd           ON bd.id = fat.bandeira_id  
JOIN   produto pd            ON pd.id = fat.produto_id  
JOIN   loja_extensao le      ON le.id = fat.loja_extensao_id  
JOIN   conta ct              ON ct.id = fat.conta_id
JOIN   banco bc              ON bc.id = ct.banco_id
LEFT   JOIN modo_captura mc  ON mc.id = fat.modo_captura_id  
WHERE  cv.controle_upload_arquivo_id = 6906  
AND    fat.parcela = 1  
ORDER  BY fat.data_venda, fat.data_credito
LIMIT  20;

JOIN syntax and sequence of joins

In particular I fixed the misleading LEFT JOIN to conciliacao_vendas, which is forced to act as a plain [INNER] JOIN by the later WHERE condition anyways. This should simplify query planning and allow to eliminate rows earlier in the process, which should make everything a lot cheaper. Related answer with detailed explanation:

Explain JOIN vs. LEFT JOIN and WHERE condition performance suggestion in more detail

USING is just a syntactical shorthand.

Since there are many tables involved in the query and the order the rewritten query joins tables is optimal now, you can fine-tune this with SET LOCAL join_collapse_limit = 1 to save planning overhead and avoid inferior query plans. Run in a single transaction:

BEGIN;
SET LOCAL join_collapse_limit = 1;
SELECT ...;  -- read data here
COMMIT;      -- or ROOLBACK;

More about that:

Sample Query to show Cardinality estimation error in PostgreSQL
The fine manual on Controlling the Planner with Explicit JOIN Clauses

Index

Add some indexes on lookup tables with lots or rows (not necessary for just a couple of dozens), in particular (taken from your query plan):

Seq Scan on public.conta ct ... rows=6771
Seq Scan on public.loja lj ... rows=1568
Seq Scan on public.loja_extensao le ... rows=16394

That's particularly odd, because those columns look like primary key columns and should already have an index ...

So:

CREATE INDEX conta_pkey_idx ON public.conta (id);
CREATE INDEX loja_pkey_idx ON public.loja (id);
CREATE INDEX loja_extensao_pkey_idx ON public.loja_extensao (id);

To make this really fat, a multicolumn index would be of great service:

CREATE INDEX foo ON Table1 (parcela, data_venda, data_credito);

how does TPC-H query performace involving string pattern matching predicates drastically improved in PostgreSQL

The algorithm used for string pattern matching in queries involving LIKE clause may have been modified in PostgreSQL9.1. Since then LIKE queries have better selectivity estimation compared to earlier in postgresql.

Why does a slight change in the search term slow down the query so much?

Why?

The reason is this:

Fast query:


->  Hash Left Join  (cost=1378.60..2467.48 rows=15 width=79) (actual time=41.759..85.037 rows=1129 loops=1)
      ...
      Filter: (unaccent(((((COALESCE(p.abrev, ''::character varying))::text || ' ('::text) || (COALESCE(p.prenome, ''::character varying))::text) || ')'::text)) ~~* (...)

Slow query:


->  Hash Left Join  (cost=1378.60..2467.48 rows=1 width=79) (actual time=35.084..80.209 rows=1129 loops=1)
      ...
      Filter: (unaccent(((((COALESCE(p.abrev, ''::character varying))::text || ' ('::text) || (COALESCE(p.prenome, ''::character varying))::text) || ')'::text)) ~~* unacc (...)

Extending the search pattern by another character causes Postgres to assume yet fewer hits. (Typically, this is a reasonable estimate.) Postgres obviously does not have precise enough statistics (none, actually, keep reading) to expect the same number of hits that you really get.

This causes a switch to a different query plan, which is even less optimal for the actual number of hits rows=1129.

Solution

Assuming current Postgres 9.5 since it has not been declared.

One way to improve the situation is to create an expression index on the expression in the predicate. This makes Postgres gather statistics for the actual expression, which can help the query even if the index itself is not used for the query. Without the index, there are no statistics for the expression at all. And if done right the index can be used for the query, that's even much better. But there are multiple problems with your current expression:

~~unaccent(TEXT(coalesce(p.abrev,'')||' ('||coalesce(p.prenome,'')||')')) ilike unaccent('%vicen%')~~

Consider this updated query, based on some assumptions about your undisclosed table definitions:

SELECT e.id
     , (SELECT count(*) FROM imgitem
        WHERE tabid = e.id AND tab = 'esp') AS imgs -- count(*) is faster
     , e.ano, e.mes, e.dia
     , e.ano::text || to_char(e.mes2, 'FM"-"00')
                   || to_char(e.dia,  'FM"-"00') AS data    
     , pl.pltag, e.inpa, e.det, d.ano anodet
     , format('%s (%s)', p.abrev, p.prenome) AS determinador
     , d.tax
     , coalesce(v.val,v.valf)   || ' ' || vu.unit  AS altura
     , coalesce(v1.val,v1.valf) || ' ' || vu1.unit AS dap
     , d.fam, tf.nome família, d.gen, tg.nome AS gênero, d.sp
     , ts.nome AS espécie, d.inf, e.loc, l.nome localidade, e.lat, e.lon
FROM      pess    p                        -- reorder!
JOIN      det     d   ON d.detby   = p.id  -- INNER JOIN !
LEFT JOIN tax     tf  ON tf.oldfam = d.fam
LEFT JOIN tax     tg  ON tg.oldgen = d.gen
LEFT JOIN tax     ts  ON ts.oldsp  = d.sp
LEFT JOIN tax     ti  ON ti.oldinf = d.inf  -- unused, see @joop's comment
LEFT JOIN esp     e   ON e.det     = d.id
LEFT JOIN loc     l   ON l.id      = e.loc
LEFT JOIN var     v   ON v.esp     = e.id AND v.key  = 265
LEFT JOIN varunit vu  ON vu.id     = v.unit
LEFT JOIN var     v1  ON v1.esp    = e.id AND v1.key = 264
LEFT JOIN varunit vu1 ON vu1.id    = v1.unit
LEFT JOIN pl          ON pl.id     = e.pl
WHERE f_unaccent(p.abrev)   ILIKE f_unaccent('%' || 'vicenti' || '%') OR
      f_unaccent(p.prenome) ILIKE f_unaccent('%' || 'vicenti' || '%');

Major points

Why f_unaccent()? Because unaccent() can't be indexed. Read this:

Does PostgreSQL support "accent insensitive" collations?

I used the function outlined there to allow the following (recommended!) multicolumn functional trigram GIN index:

CREATE INDEX pess_unaccent_nome_trgm_idx ON pess
USING gin (f_unaccent(pess) gin_trgm_ops, f_unaccent(prenome) gin_trgm_ops);

If you are not familiar with trigram indexes, read this first:

PostgreSQL LIKE query performance variations

And possibly:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

Be sure to run the latest version of Postgres (currently 9.5). There have been substantial improvements to GIN indexes. And you'll be interested in improvements in pg_trgm 1.2, scheduled to be released with the upcoming Postgres 9.6:

Trigram search gets much slower as search string gets longer

Prepared statements are a common way to execute queries with parameters (especially with text from user input). Postgres has to find a plan that works best for any given parameter. Add wildcards as constants to the to the search term like this:

f_unaccent(p.abrev) ILIKE f_unaccent('%' || 'vicenti' || '%')

('vicenti' would be replaced with a parameter.) So Postgres knows we are dealing with a pattern that is neither anchored left nor right - which would allow different strategies. Related answer with more details:

Performance impact of empty LIKE in a prepared statement

Or maybe re-plan the query for every search term (possibly using dynamic SQL in a function). But make sure planning time isn't eating any possible performance gain.

The WHERE condition on columns in pess contradicts the ~~LEFT JOIN~~. Postgres is forced to convert that to an INNER JOIN. What's worse the join comes late in the join tree. And since Postgres cannot reorder your joins (see below), that can become very expensive. Move the table to the first position in the FROM clause to eliminate rows early. Following LEFT JOINs do not eliminate any rows by definition. But with that many tables it is important to move joins that might multiply rows to the end.

You are joining 13 tables, 12 of them with LEFT JOIN which leaves 12! possible combinations - or 11! * 2! if we take the one LEFT JOIN into account that's really an INNER JOIN. That's too many for Postgres to evaluate all possible permutations for the best query plan. Read about join_collapse_limit:

Sample Query to show Cardinality estimation error in PostgreSQL
SQL INNER JOIN over multiple tables equal to WHERE syntax

The default setting for join_collapse_limit is 8, which means that Postgres won't try to reorder tables in your FROM clause and the order of tables is relevant.

One way work around this would be to split the performance-critical part into a CTE like @joop commented. Don't set join_collapse_limit much higher or times for query planning involving many joined tables will deteriorate.

About your concatenated date named data:

~~cast(cast(e.ano as varchar(4))||'-'||right('0'||cast(e.mes as varchar(2)),2)||'-'|| right('0'||cast(e.dia as varchar(2)),2) as varchar(10)) as data~~

Assuming you build from three numeric columns for year, month and day, which are defined NOT NULL, use this instead:

e.ano::text || to_char(e.mes2, 'FM"-"00')
            || to_char(e.dia,  'FM"-"00') AS data

About the FM template pattern modifier:

Check for integer in string array

But really, you should store the date as data type date to begin with.

Also simplified:

format('%s (%s)', p.abrev, p.prenome) AS determinador

Won't make the query faster, but it's much cleaner. See format().

First things last, all the usual advice for performance optimization applies:

Keep PostgreSQL from sometimes choosing a bad query plan

If you get all of this right, you should see much faster queries for all patterns.

How to optimize a SQL query that combines INNER JOINs, DISTINCT and WHERE?

The best query depends on missing information.

This should be substantially faster in a typical setup:

SELECT id, foo_option_id, description
FROM   options o
WHERE  EXISTS (
   SELECT
   FROM   discounted_vehicles d
   JOIN   vehicle_options vo USING (vehicle_id)
   WHERE  d.discount_id = 4
   AND    vo.option_id = o.id
   );

Assuming referential integrity, enforced by FK constraints, we can omit the table vehicle from the query and join from discounted_vehicles to vehicle_options directly.

Also, EXISTS is typically faster if there are many qualifying rows per distinct option.

Ideally, you'd have multicolumn indexes on:

discounted_vehicles(discount_id, vehicle_id)
vehicle_options(vehicle_id, option_id)

Index columns in this order. You probably have a PK constraint on vehicle_options providing the 2nd index, but the column order should match. Related:

PostgreSQL composite primary key
Is a composite index also good for queries on the first field?

Depending on actual data distribution, there may be faster query styles. Related:

Optimize GROUP BY query to retrieve latest record per user
Select first row in each GROUP BY group?

Changing the join order is typically futile. Postgres reorders joins any way it expects to be fastest. (Exceptions apply.) Related:

Sample Query to show Cardinality estimation error in PostgreSQL
SQL INNER JOIN over multiple tables equal to WHERE syntax
Why does a slight change in the search term slow down the query so much?

Count on join of big tables with conditions is slow

Your query, rewritten and 100 % equivalent:

SELECT count(*)
FROM   product_categories   pc 
JOIN   customers            c  USING (organization_id) 
JOIN   total_sales          ts ON ts.customer_id = c.id
JOIN   performance_analyses pa ON pa.total_sales_id = ts.id
WHERE  pc.organization_id = 3
AND    c.active  -- boolean can be used directly
AND    c.visible
AND    ts.product_category_id = pc.id
AND    ts.period_id = 193
AND    pa.size > 0;

Another answer advises to move all conditions into join clauses and order tables in the FROM list. This may apply for a certain other RDBMS with a comparatively primitive query planner. But while it doesn't hurt for Postgres either, it also has no effect on performance for your query - assuming default server configuration. The manual:

Explicit inner join syntax (INNER JOIN, CROSS JOIN, or unadorned JOIN)
is semantically the same as listing the input relations in FROM, so it
does not constrain the join order.

Bold emphasis mine. There is more, read the manual.

The key setting is join_collapse_limit (with default 8). The Postgres query planner will rearrange your 4 tables any way it expects it to be fastest, no matter how you arranged your tables and whether you write conditions as WHERE or JOIN clauses. No difference whatsoever. (The same is not true for some other types of joins that cannot be rearranged freely.)

The important point is that these different join possibilities give
semantically equivalent results but might have hugely different
execution costs. Therefore, the planner will explore all of them to
try to find the most efficient query plan.

Sample Query to show Cardinality estimation error in PostgreSQL
A: Slow fulltext search due to wildly inaccurate row estimates

Finally, WHERE id IN (<subquery>) is not generally equivalent to a join. It does not multiply rows on the left side for duplicate matching values on the right side. And columns of the subquery are not visible for the rest of the query. A join can multiply rows with duplicate values and columns are visible.

Your simple subqueries dig up a single unique column in both cases, so there is no effective difference in this case - except that IN (<subquery>) is generally (at least a bit) slower and more verbose. Use joins.

Your query

Indexes

product_categories has 34 rows. Unless you plan on adding many more, indexes do no help performance for this table. A sequential scan will always be faster. Drop ~~index_product_categories_on_organization_id~~.

customers has 6,970 rows. Indexes start to make sense. But your query uses 4,988 of them according to the EXPLAIN output. Only an index-only scan on an index much less wide than the table could help a bit. Assuming WHERE active AND visible are constant predicates, I suggest a partial multicolumn index:

CREATE INDEX index_customers_on_organization_id ON customers (organization_id, id)
WHERE active AND visible;

I appended id to allow index-only scans. The column is otherwise useless in the index for this query.

total_sales has 7,104,441 rows. Indexes are very important. I suggest:

CREATE INDEX index_total_sales_on_product_category_customer_id
ON total_sales (period_id, product_category_id, customer_id, id)

Again, aiming for an index-only scan. This is the most important one.

You can delete the completely redundant index ~~index_total_sales_on_product_category_id~~.

performance_analyses has 1,012,346 rows. Indexes are very important.
I would suggest another partial index with the condition size > 0:

CREATE INDEX index_performance_analyses_on_status_id
ON performance_analyses (total_sales_id)
WHERE pa.size > 0;

However:

Rows Removed by Filter: 0"

Seems like this conditions serves no purpose? Are there any rows with size > 0 is not true?

After creating these indexes you need to ANALYZE the tables.

Tables statistics

Generally, I see many bad estimates. Postgres underestimates the number of rows returned at almost every step. The nested loops we see would work much better for fewer rows. Unless this is an unlikely coincidence, your table statistics are badly outdated. You need to visit your settings for autovacuum and probably also per-table settings for your two big tables
performance_analyses and total_sales.

You already did run VACUUM and ANALYZE, which made the query slower, according to your comment. That doesn't make a lot of sense. I would run VACUUM FULL on these two tables once (if you can afford an exclusive lock). Else try pg_repack.

With all the fishy statistics and bad plans I would consider running a complete vacuumdb -fz yourdb on your DB. That rewrites all tables and indexes in pristine conditions, but it's no good to use on a regular basis. It's also expensive and will lock your DBs for an extended period of time!

While being at it, have a look at the cost settings of your DB as well.
Related:

Keep PostgreSQL from sometimes choosing a bad query plan
Postgres Slow Queries - Autovacuum frequency

Sample Query to Show Cardinality Estimation Error in Postgresql