Does the Number of Columns Returned Affect the Speed of a Query

Does the number of columns returned affect the speed of a query?

You better avoid SELECT *

  • It leads to confusion when you change the table layout.
  • It selects unneeded columns, and your data packets get larger.
  • The columns can get duplicate names, which is also not good for some applications
  • If all the columns you need are covered by an index, SELECT columns will only use this index, while SELECT * will need to visit the table records to get the values you don't need. Also bad for performance.

Does number of columns in a table affect the performance of a count(*) query on a table with no indexes?

Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed.

Why is that?

Every DBMS stores rows in blocks - typically 8k blocks, but not necessarily. Especially data warehouse systems tend to use larger block sizes. If a table has many columns with a lot of data in them (think varchar columns) this means that on less rows fit on a single database block.

For a transactional system that supports proper isolation a count(*) query must query all rows in a table (and check for each row if it's currently visible to your transaction). The smallest unit a DBMS reads from the harddisk is a block. So the more rows fit onto a block, the less I/O needs to be done.

If a rows takes 100 bytes in average a single block will contain approximately 80 rows. To count through all rows in a table with 80 rows, the database only needs to do a single I/O operation (in reality it's a bit more to look up the table itself).

Now if each row needs 1000 bytes, a single block contains approximately 8 rows, which in turn means to count all rows the DB needs to do 8 I/O operations.

Even if the data is cached, it's still 1 "logical" I/O vs. 8 "logical" I/O operations.

The above only holds true if no indexes are involved.

This effect can be observed with any DBMS that supports a detailed execution plan. Take this sample script for Postgres:

create table data5 (c1 text, c2 text, c3 text, c4 text, c5 text);
insert into data5
select rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X')
from generate_series(1,100000);

create table data10 (c1 text, c2 text, c3 text, c4 text, c5 text, c6 text, c7 text, c8 text, c9 text, c10 text);
insert into data10
select rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X'),
rpad('X',50,'X')
from generate_series(1,100000);

The above creates two tables with 100.000 rows each. One with 5 columns, the other with 10.

When doing an explain analyze, the following will be returned (Postgres 9.3):

explain (analyze, buffers, verbose)
select count(*)
from data5;

Aggregate (cost=4192.00..4192.01 rows=1 width=0) (actual time=27.539..27.539 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=2942
-> Seq Scan on stuff.data5 (cost=0.00..3942.00 rows=100000 width=0) (actual time=0.005..16.158 rows=100000 loops=1)
Output: c1, c2, c3, c4, c5
Buffers: shared hit=2942
Total runtime: 27.595 ms

The line Buffers: shared hit=2942 tells us that Postgres had to look at 2942 blocks in order to read through the entire table.

Now for the table with 10 columns:

explain (analyze, buffers, verbose)
select count(*)
from data10;

Aggregate (cost=7917.00..7917.01 rows=1 width=0) (actual time=34.964..34.965 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=6667
-> Seq Scan on stuff.data10 (cost=0.00..7667.00 rows=100000 width=0) (actual time=0.010..22.187 rows=100000 loops=1)
Output: c1, c2, c3, c4, c5, c6, c7, c8, c9, c10
Buffers: shared hit=6667
Total runtime: 35.025 ms

We can see that Postgres had to look at 6667 blocks to get the count.

A similar thing can be observed using Oracle's SQL*Plus and the set autotrace statistics option that will also show the amount of (logical) I/O that was performed.

The statistics for the data5 table would look something like this:

VALUE | STATISTIC                             
------+---------------------------------------
140 | bytes received via SQL*Net from client
755 | bytes sent via SQL*Net to client
2977 | consistent gets
0 | db block gets
0 | physical reads
0 | recursive calls
0 | redo size
2977 | session logical reads
1 | sorts (memory)
2 | SQL*Net roundtrips to/from client
1 | rows processed

The "consistent gets" denote the amount of logical I/O:

For the data10 table the output is as follows:

VALUE | STATISTIC                             
------+---------------------------------------
141 | bytes received via SQL*Net from client
615 | bytes sent via SQL*Net to client
7184 | consistent gets
0 | db block gets
0 | physical reads
0 | recursive calls
0 | redo size
7184 | session logical reads
1 | sorts (memory)
2 | SQL*Net roundtrips to/from client
1 | rows processed

Again we clearly see the increase in (logical) I/O

Does number of columns affect MYSQL speed?

Yes, extra data can slow down queries because it means fewer rows can fit into a page, and this means more disk accesses to read a certain number of rows and fewer rows can be cached in memory.

The exact factor in slow down is hard to predict. It could be negligible, but if you are near the boundary between being able to cache the entire table in memory or not, a few extra columns could make a big difference to the execution speed. The difference in the time it takes to fetch a row from a cache in memory or from disk is several orders of magnitude.

If you add a covering index the extra columns should have less of an impact as the query can use the relatively narrow index without needing to refer to the wider main table.

Does the speed of the query depend on the number of rows in the table?

The are many factors on the speed of a query, one of which can be the number of rows.

Others include:

  • index strategy (if you index column "x", you will see better performance than if it's not indexed)
  • server load
  • data caching - once you've executed a query, the data will be added to the data cache. So subsequent reruns will be much quicker as the data is coming from memory, not disk. Until such point where the data is removed from the cache
  • execution plan caching - to a lesser extent. Once a query is executed for the first time, the execution plan SQL Server comes up with will be cached for a period of time, for future executions to reuse.
  • server hardware
  • the way you've written the query (often one of the biggest contibutors to poor performance!). e.g. writing something using a cursor instead of a set-based operation

For databases with a large number of rows in tables, partitioning is usually something to consider (with SQL Server 2005 onwards, Enterprise Edition there is built-in support). This is to split the data down into smaller units. Generally, smaller units = smaller tables = smaller indexes = better performance.

Does increasing count of columns affects 'SELECT' statement?

There are basically two parts to consider.

First, there's how data is lied out on disk. The smallest unit of data read generally is a page, containing one or more rows. Adding columns costs more storage per row, so retrieving a record with more columns will require the storage engine to read more pages, resulting in more disk I/O.

If the columns you select are in an index however, the row data may not even have to be read, as the column's data can be fetched from the index.

Then there's caching. Given a fixed size of memory cache, fewer rows will fit in there if you have more columns, so by adding more columns you'll have less data ready to be served from cache.

Second there's network traffic, which is affected by the columns requested. A SELECT * will have to package all row data into network packets, while selecting specific columns will only transfer the data of those columns.

More network traffic per row means less rows per second downloaded by your application.

So in short: only select the columns you're going to need, and consider including relevant columns in an index where appropriate.



Related Topics



Leave a reply



Submit