Improving query speed: simple SELECT in big postgres table
Extracting my comments into an answer: the index lookup here was very fast -- all the time was spent retrieving the actual rows. 23 seconds / 7871 rows = 2.9 milliseconds per row, which is reasonable for retrieving data that's scattered across the disk subsystem. Seeks are slow; you can a) fit your dataset in RAM, b) buy SSDs, or c) organize your data ahead of time to minimize seeks.
PostgreSQL 9.2 has a feature called index-only scans that allows it to (usually) answer queries without accessing the table. You can combine this with the btree
index property of automatically maintaining order to make this query fast. You mention int1
, int2
, and two floats:
CREATE INDEX sometable_int1_floats_key ON sometable (int1, float1, float2);
CREATE INDEX sometable_int2_floats_key ON sometable (int2, float1, float2);
SELECT float1,float2 FROM sometable WHERE int1=<value>; -- uses int1 index
SELECT float1,float2 FROM sometable WHERE int2=<value>; -- uses int2 index
Note also that this doesn't magically erase the disk seeks, it just moves them from query time to insert time. It also costs you storage space, since you're duplicating the data. Still, this is probably the trade-off you want.
PostgreSQL: speed up SELECT query in table with millions of rows
The index won't help.
Two solutions:
You chould either change the query to:
WHERE dep_dt >= '2017-08-15 00:00:00' AND dep_dt < '2017-08-16 00:00:00'
Then the index can be used.
Create an index on an expression:
CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date));
(or a different time zone) and change the query to
WHERE (dep_dt AT TIME ZONE 'UTC')::date = '2017-08-16'
The
AT TIME ZONE
is necessary because otherwise the result of the cast would depend on your currentTimeZone
setting.
The first solution is simpler, but the second has the advantage that you can add price_ct
to the index like this:
CREATE INDEX ON all_legs(((dep_dt AT TIME ZONE 'UTC')::date), price_ct);
Then you don't need a sort any more, and your query will be as fast as it can theoretically get.
Postgres select on a very large table
You are currently removing 855 rows using a filter on col3 and col4, in order to find 10 rows which pass that filter. So as I feared, the things that fail that filter might be rarer than other things, but they are sitting right in the way. And the next time you need 10 more things, they will still be in the way. And the next time. Not only are you doing 85 times more work than you need to for every execution, you are hitting ~85 times more pages. If that same thing happens for every other combination of col1 and col2, then no wonder you keep running out cache space and IOPS. And of course there is no reason for it to stop there, you could have far more than 850 accumulate in the way, if you have nothing to get rid of them.
You could use a partial index to avoid visiting those rows each time:
create index on t (col1, col2, col5) where col3='f' and col4>0;
Alternatively, each time col3 turns true or col4 turns 0, you could just delete the row, and (possibly) insert it into some history table if you need to keep some record of it.
Postgres Large table Optimal index selection for Select Query With In Clause
CREATE INDEX table_name_desc_index ON table_name (userid,_date DESC NULLS LAST);
Then try following:
SELECT *
FROM table_name
inner join (values (1),(2),(3),(6),(7),(10)) val(v) ON (table_name.status = v )
WHERE user_id = $user_id
ORDER BY date
DESC LIMIT 20;
postgres large table select optimization
It's too hard to make an index for queries like strin LIKE pattern
because wildcards (% and _) can stand everywhere.
I can suggest one risky solution:
Slightly redesign the table - make it indexable. Add two more column
prefix_low
andprefix_high
of fixed width - for examplechar(32)
, or any arbitrary length enough for the task. Also add onesmallint
column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;
prefix_low | prefix_high | prefix_length
----------------------------------+---------------------------------------+-----
00436641997142000000000000000000 | 00436641997142999999999999999999 | 14Make index with these values
CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);
Check modified requests against table:
SELECT prefix, changeprefix, deletelast, outgroup, tariff
FROM table
WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
AND company = 0
AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0 or "Day" is null )
ORDER BY prefix_length DESC
LIMIT 1Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.
Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:
import re
def rewrite_query(username, query):
q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
if not re.match(q1, query):
return query # nothing to do with other queries
else:
new_query = # ... rewrite query here
return new_queryRun pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.
When proxy is ready and debugged reconnect your application to pgBouncer.
Pro:
- no changes to application
- no changes to basic structure of DB
Contra:
- extra maintenance - you need triggers to keep all new columns with actual data
- extra tools to support
- rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.
Further development:
highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing
Optimizing SELECT count(*) on large table
The workers planned is 4 but launched 0, is that normal?
It can happen when too many concurrent transactions compete for a limited number of allowed parallel workers. The manual:
The number of background workers that the planner will consider using
is limited to at mostmax_parallel_workers_per_gather
. The
total number of background workers that can exist at any one time is
limited by bothmax_worker_processes
andmax_parallel_workers
. Therefore, it is possible for a
parallel query to run with fewer workers than planned, or even with no
workers at all. The optimal plan may depend on the number of workers
that are available, so this can result in poor query performance. If
this occurrence is frequent, consider increasingmax_worker_processes
andmax_parallel_workers
so that more workers
can be run simultaneously or alternatively reducingmax_parallel_workers_per_gather
so that the planner requests fewer
workers.
You can also optimize overall performance to free up resources, or get better hardware (in addition to ramping up max_parallel_workers
).
What's also troubling:
Heap Fetches: 46261956
For 90993600 rows. That's way too many for comfort. An index-only scan is not supposed to do that many heap fetches.
Both of these symptoms would indicate massive concurrent write access (or long-running transactions hogging resources and keeping autovacuum
from doing its job). Look into that, and/or tune per-table autovacuum
settings for table product
to be more aggressive, so that columns statistics are more valid and the visibility map can keep up. See:
- Aggressive Autovacuum on PostgreSQL
Also, with halfway valid table statistics, a (blazingly fast!) estimate might be good enough? See:
- Fast way to discover the row count of a table in PostgreSQL
Related Topics
Convert Unknown Number of Comma Separated Varchars Within 1 Column into Multiple Columns
Fastest Way to Insert in Parallel to a Single Table
Dataset Panel (Report Data) in Ssrs Designer Is Gone
Difference Between Stored Procedures and User Defined Functions
Kafka Connect Jdbc VS Debezium Cdc
Difference Between Fetch/For to Loop a Cursor in Pl/Sql
Insert Xml into SQL Server 2008 Database
Oracle Insert via Select from Multiple Tables Where One Table May Not Have a Row
Randomly Select a Row with SQL in Access
How to Make a SQL Query for Last Transaction of Every Account
Pass String Variable Without Quotes in Query Vba
Please Correct Me with the SQL Query
SQL Update If Parameter Is Not Null or Empty