Is Using Count(*) or Select * a Good Idea

Is using COUNT(*) or SELECT * a good idea?

1. On count(*) vs. count(something else)

SQL is declarative in that you specify what you want. This is different from specifying how to get what you want. That means the database engine is free to realize your query in whatever way it thinks is the most efficient. Many database optimizers rewrites your query to a less costly alternative (if such a plan is available).

Given the following table:

table(
pk not null
,color not null
,nullable null
,unique(pk)
,index(color)
);

...all of the following are functionally equivalent (due to the mechanics of count and nulls):

1) select count(*) from table;
2) select count(1) from table;
3) select count(pk) from table;
4) select count(color) from table;

Regardless of which form you use, the optimizer is free to rewrite the query to another form if it is more efficient. (Again, not all optimizers are sophisticated enough to do this). The unique index(pk) would be smaller (bytes occupied) than the entire table. Therefore it would be more efficient to count the number of index entries rather than scanning through the entire table. In Oracle we have bitmap indexes, which also compress repeating strings. If we had used such an index on the color column, it would probably have been the smallest index to scan. Oracle also supports table compression which in some cases makes the physical table smaller than a composite index.

1. TL;DR;
Your specific dbms will have its own set of tools that enables different rewriting rules and in turn execution plans. That renders the question somewhat useless (unless we talk about a specific release of a specific dbms). I recommend COUNT(*) in all cases because it requires the least cognitive effort to grasp.

2. On select a,b,c vs. select *

There are very few valid uses of SELECT * in code you write and put into production. Imagine a table which contains Bluray movies (yes, the movies is stored as a blob in this table). So you slapped together your awesomesauce abstraction layer and put SELECT * FROM movies where id = ? in the getMovies(movie_id) method. I will refrain myself from explaining why SELECT name FROM movies will be transported across the network just a tad faster. Of course, in most realistic cases it won't have a noticable impact.

One last point on performance is that when all the referenced columns (selected, filtered) in your query exists as an index (called a covering index), the database need not touch the table at all. It can be fully resolved from scanning the index only. By selecting all columns you remove this option from the optimizer.

Another thing about SELECT * which is far more serious than anything, is that it creates an implicit dependency on a specific physical layout of the table. Let me explain. Consider the following tables:

table T1(name, id)
table T2(name, id)

The following statement...

insert into t1 select * from t2;

... will break or produce a different result if any of the following happens:

  • Any of the tables columns are rearranged for example T1(id, name)
  • T1 gets an additional not-null column
  • T2 gets another column

2. TL;DR; When possible, explicitly specify the columns you want (eventually, you'll have to do that anyway). Also, selecting fewer columns are faster than selecting more columns. A possitive side-effect on explicit selects is that it gives greater freedom to the optimizer.

SQL: using count() vs. keeping a separate field

If your table is indexed, which it almost certainly will be, then the performance of select count(*) probably will not be as bad as you might anticipate - even if you have millions of rows.

But, if it does become a concern, then rather than roll your own solution, look into using an indexed view.

Should I COUNT(*) or not?

If the column in question is NOT NULL, both of your queries are equivalent. When group_id contains null values,

select count(*)

will count all rows, whereas

select count(group_id)

will only count the rows where group_id is not null.

Also, some database systems, like MySQL employ an optimization when you ask for count(*) which makes such queries a bit faster than the specific one.

Personally, when just counting, I'm doing count(*) to be on the safe side with the nulls.

Is COUNT faster than pulling the records and counting in code?

If you know you need the data, go ahead and pull it and count it in code. However, if you only need the count, it is significantly faster to pull the count from the database than it is to actually retrieve rows. Also it is standard practice to only pull what you need.

For instance, if you are counting all the rows in a table, most database implementations do not need to look at any rows. Tables know how many rows they have. If the query has filters in the where clause and it can use an index, it again will not need to look at the actual rows' data, just counts the rows from the index.

And all this is not counting the less data transferred.

A rule of thumb about database speeds is go ahead and try it for yourself. General rules are not always a good indicator. For instance, if the table was 10 rows and only a few columns, I might just pull the whole thing anyway on the off chance I needed it, since 2 round trips to the database would outweigh the cost of the query.

Should you always do a COUNT(*) before a SELECT * to determine if there are any rows?

Unless you lock the table/s in question, doing a select count(*) is useless. Consider:

Process 1:

SELECT COUNT(*) FROM T;

Process 2:

INSERT INTO T

Process 1:

...now doing something based on the obsolete count retrieved before...

Of course, locking a table is not a very good idea in a server environment.

COUNT(*) vs. COUNT(1) vs. COUNT(pk): which is better?

Bottom Line

Use either COUNT(field) or COUNT(*), and stick with it consistently, and if your database allows COUNT(tableHere) or COUNT(tableHere.*), use that.

In short, don't use COUNT(1) for anything. It's a one-trick pony, which rarely does what you want, and in those rare cases is equivalent to count(*)

Use count(*) for counting

Use * for all your queries that need to count everything, even for joins, use *

SELECT boss.boss_id, COUNT(subordinate.*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

But don't use COUNT(*) for LEFT joins, as that will return 1 even if the subordinate table doesn't match anything from parent table

SELECT boss.boss_id, COUNT(*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

Don't be fooled by those advising that when using * in COUNT, it fetches entire row from your table, saying that * is slow. The * on SELECT COUNT(*) and SELECT * has no bearing to each other, they are entirely different thing, they just share a common token, i.e. *.

An alternate syntax

In fact, if it is not permitted to name a field as same as its table name, RDBMS language designer could give COUNT(tableNameHere) the same semantics as COUNT(*). Example:

For counting rows we could have this:

SELECT COUNT(emp) FROM emp

And they could make it simpler:

SELECT COUNT() FROM emp

And for LEFT JOINs, we could have this:

SELECT boss.boss_id, COUNT(subordinate)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

But they cannot do that (COUNT(tableNameHere)) since SQL standard permits naming a field with the same name as its table name:

CREATE TABLE fruit -- ORM-friendly name
(
fruit_id int NOT NULL,
fruit varchar(50), /* same name as table name,
and let's say, someone forgot to put NOT NULL */
shape varchar(50) NOT NULL,
color varchar(50) NOT NULL
)

Counting with null

And also, it is not a good practice to make a field nullable if its name matches the table name. Say you have values 'Banana', 'Apple', NULL, 'Pears' on fruit field. This will not count all rows, it will only yield 3, not 4

SELECT count(fruit) FROM fruit

Though some RDBMS do that sort of principle (for counting the table's rows, it accepts table name as COUNT's parameter), this will work in Postgresql (if there is no subordinate field in any of the two tables below, i.e. as long as there is no name conflict between field name and table name):

SELECT boss.boss_id, COUNT(subordinate)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

But that could cause confusion later if we will add a subordinate field in the table, as it will count the field(which could be nullable), not the table rows.

So to be on the safe side, use:

SELECT boss.boss_id, COUNT(subordinate.*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

count(1): The one-trick pony

In particular to COUNT(1), it is a one-trick pony, it works well only on one table query:

SELECT COUNT(1) FROM tbl

But when you use joins, that trick won't work on multi-table queries without its semantics being confused, and in particular you cannot write:

-- count the subordinates that belongs to boss
SELECT boss.boss_id, COUNT(subordinate.1)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

So what's the meaning of COUNT(1) here?

SELECT boss.boss_id, COUNT(1)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

Is it this...?

-- counting all the subordinates only
SELECT boss.boss_id, COUNT(subordinate.boss_id)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

Or this...?

-- or is that COUNT(1) will also count 1 for boss regardless if boss has a subordinate
SELECT boss.boss_id, COUNT(*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

By careful thought, you can infer that COUNT(1) is the same as COUNT(*), regardless of type of join. But for LEFT JOINs result, we cannot mold COUNT(1) to work as: COUNT(subordinate.boss_id), COUNT(subordinate.*)

So just use either of the following:

-- count the subordinates that belongs to boss
SELECT boss.boss_id, COUNT(subordinate.boss_id)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

Works on Postgresql, it's clear that you want to count the cardinality of the set

-- count the subordinates that belongs to boss
SELECT boss.boss_id, COUNT(subordinate.*)
FROM boss
LEFT JOIN subordinate on subordinate.boss_id = boss.boss_id
GROUP BY boss.id

Another way to count the cardinality of the set, very English-like (just don't make a column with a name same as its table name) : http://www.sqlfiddle.com/#!1/98515/7

select boss.boss_name, count(subordinate)
from boss
left join subordinate on subordinate.boss_code = boss.boss_code
group by boss.boss_name

You cannot do this: http://www.sqlfiddle.com/#!1/98515/8

select boss.boss_name, count(subordinate.1)
from boss
left join subordinate on subordinate.boss_code = boss.boss_code
group by boss.boss_name

You can do this, but this produces wrong result: http://www.sqlfiddle.com/#!1/98515/9

select boss.boss_name, count(1)
from boss
left join subordinate on subordinate.boss_code = boss.boss_code
group by boss.boss_name

Count(*) vs Count(1) - SQL Server

There is no difference.

Reason:

Books on-line says "COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )"

"1" is a non-null expression: so it's the same as COUNT(*).
The optimizer recognizes it for what it is: trivial.

The same as EXISTS (SELECT * ... or EXISTS (SELECT 1 ...

Example:

SELECT COUNT(1) FROM dbo.tab800krows
SELECT COUNT(1),FKID FROM dbo.tab800krows GROUP BY FKID

SELECT COUNT(*) FROM dbo.tab800krows
SELECT COUNT(*),FKID FROM dbo.tab800krows GROUP BY FKID

Same IO, same plan, the works

Edit, Aug 2011

Similar question on DBA.SE.

Edit, Dec 2011

COUNT(*) is mentioned specifically in ANSI-92 (look for "Scalar expressions 125")

Case:

a) If COUNT(*) is specified, then the result is the cardinality of T.

That is, the ANSI standard recognizes it as bleeding obvious what you mean. COUNT(1) has been optimized out by RDBMS vendors because of this superstition. Otherwise it would be evaluated as per ANSI

b) Otherwise, let TX be the single-column table that is the
result of applying the <value expression> to each row of T
and eliminating null values. If one or more null values are
eliminated, then a completion condition is raised: warning-

is SELECT COUNT(*) expensive?

COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause. For example:

mysql> SELECT COUNT(*) FROM student;

This optimization applies only to MyISAM tables only, because an exact row count is stored for this storage engine and can be accessed very quickly.

Source

As you said you use MyISAM and your query is for the whole table, it doesn't matter if its 1 or 100000 rows.

SQL Query: Which one should i use? count(columnname) or count(1)

There can be differences between count(*) and count(column). count(*) is often fastest for reasons discussed here. Basically, with count(column) the database has to check if column is null or not in each row. With count(column) it just returns the total number of rows in the table which is probably has on hand. The exact details may depend on the database and the version of the database.

Short answer: use count(*) or count(1). Hell, forget the count and select userid.

You should also make sure the where clause is performing well and that its using an index. Look into EXPLAIN.



Related Topics



Leave a reply



Submit