Improving Query Speed: Simple Select in Big Postgres Table

Improving query speed: simple SELECT in big postgres table

Extracting my comments into an answer: the index lookup here was very fast -- all the time was spent retrieving the actual rows. 23 seconds / 7871 rows = 2.9 milliseconds per row, which is reasonable for retrieving data that's scattered across the disk subsystem. Seeks are slow; you can a) fit your dataset in RAM, b) buy SSDs, or c) organize your data ahead of time to minimize seeks.

PostgreSQL 9.2 has a feature called index-only scans that allows it to (usually) answer queries without accessing the table. You can combine this with the btree index property of automatically maintaining order to make this query fast. You mention int1, int2, and two floats:

CREATE INDEX sometable_int1_floats_key ON sometable (int1, float1, float2);
CREATE INDEX sometable_int2_floats_key ON sometable (int2, float1, float2);

SELECT float1,float2 FROM sometable WHERE int1=<value>; -- uses int1 index
SELECT float1,float2 FROM sometable WHERE int2=<value>; -- uses int2 index

Note also that this doesn't magically erase the disk seeks, it just moves them from query time to insert time. It also costs you storage space, since you're duplicating the data. Still, this is probably the trade-off you want.

PostgreSQL: speed up SELECT query in table with millions of rows

The index won't help.

Two solutions:

  1. You chould either change the query to:

    WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'

    Then the index can be used.

  2. Create an index on an expression:

    CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date));

    (or a different time zone) and change the query to

    WHERE (dep_dt AT TIME ZONE 'UTC')::date = '2017-08-16'

    The AT TIME ZONE is necessary because otherwise the result of the cast would depend on your current TimeZone setting.

The first solution is simpler, but the second has the advantage that you can add price_ct to the index like this:

CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date), price_ct);

Then you don't need a sort any more, and your query will be as fast as it can theoretically get.

Postgres select on a very large table

You are currently removing 855 rows using a filter on col3 and col4, in order to find 10 rows which pass that filter. So as I feared, the things that fail that filter might be rarer than other things, but they are sitting right in the way. And the next time you need 10 more things, they will still be in the way. And the next time. Not only are you doing 85 times more work than you need to for every execution, you are hitting ~85 times more pages. If that same thing happens for every other combination of col1 and col2, then no wonder you keep running out cache space and IOPS. And of course there is no reason for it to stop there, you could have far more than 850 accumulate in the way, if you have nothing to get rid of them.

You could use a partial index to avoid visiting those rows each time:

create index on t (col1, col2, col5) where col3='f' and col4>0;

Alternatively, each time col3 turns true or col4 turns 0, you could just delete the row, and (possibly) insert it into some history table if you need to keep some record of it.

Postgres Large table Optimal index selection for Select Query With In Clause

CREATE INDEX table_name_desc_index ON table_name (userid,_date DESC NULLS LAST);

Then try following:

SELECT *
FROM table_name
inner join (values (1),(2),(3),(6),(7),(10)) val(v) ON (table_name.status = v )
WHERE user_id = $user_id
ORDER BY date
DESC LIMIT 20;

postgres large table select optimization

It's too hard to make an index for queries like strin LIKE pattern because wildcards (% and _) can stand everywhere.

I can suggest one risky solution:

  1. Slightly redesign the table - make it indexable. Add two more column prefix_low and prefix_high of fixed width - for example char(32), or any arbitrary length enough for the task. Also add one smallint column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:

    select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;

    prefix_low | prefix_high | prefix_length
    ----------------------------------+---------------------------------------+-----
    00436641997142000000000000000000 | 00436641997142999999999999999999 | 14
  2. Make index with these values

    CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);
  3. Check modified requests against table:

    SELECT prefix, changeprefix, deletelast, outgroup, tariff 
    FROM table
    WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
    AND company = 0
    AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
    ORDER BY prefix_length DESC
    LIMIT 1

    Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.

  4. Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:

    import re

    def rewrite_query(username, query):
    q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
    if not re.match(q1, query):
    return query # nothing to do with other queries
    else:
    new_query = # ... rewrite query here
    return new_query
  5. Run pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.

  6. When proxy is ready and debugged reconnect your application to pgBouncer.

Pro:

  • no changes to application
  • no changes to basic structure of DB

Contra:

  • extra maintenance - you need triggers to keep all new columns with actual data
  • extra tools to support
  • rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.

Further development:
highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing

Optimizing SELECT count(*) on large table

The workers planned is 4 but launched 0, is that normal?

It can happen when too many concurrent transactions compete for a limited number of allowed parallel workers. The manual:

The number of background workers that the planner will consider using
is limited to at most max_parallel_workers_per_gather. The
total number of background workers that can exist at any one time is
limited by both max_worker_processes and
max_parallel_workers. Therefore, it is possible for a
parallel query to run with fewer workers than planned, or even with no
workers at all. The optimal plan may depend on the number of workers
that are available, so this can result in poor query performance. If
this occurrence is frequent, consider increasing
max_worker_processes and max_parallel_workers so that more workers
can be run simultaneously or alternatively reducing
max_parallel_workers_per_gather so that the planner requests fewer
workers.

You can also optimize overall performance to free up resources, or get better hardware (in addition to ramping up max_parallel_workers).

What's also troubling:

Heap Fetches: 46261956

For 90993600 rows. That's way too many for comfort. An index-only scan is not supposed to do that many heap fetches.

Both of these symptoms would indicate massive concurrent write access (or long-running transactions hogging resources and keeping autovacuum from doing its job). Look into that, and/or tune per-table autovacuum settings for table product to be more aggressive, so that columns statistics are more valid and the visibility map can keep up. See:

  • Aggressive Autovacuum on PostgreSQL

Also, with halfway valid table statistics, a (blazingly fast!) estimate might be good enough? See:

  • Fast way to discover the row count of a table in PostgreSQL


Related Topics



Leave a reply



Submit