Is Count(*) in SQL Server a Constant Time Operation? If Not, Why Not

Is COUNT(*) in SQL Server a constant time operation? If not, why not?

No, COUNT(*) is not a constant time operation. A COUNT(*) must return a count of rows that conform to the current scan predicate (ie. WHERE clause), so that alone would make the return of a metadata property invalid. But even if you have no predicates, the COUNT still has to satisfy the current transaction isolation semantics, ie. return the count of rows visible (eg. committed). So COUNT must, and will, in SQL Server, actually scan and count the rows. Some systems allow return of faster 'estimate' counts.

Also, as a side comment, relying on rows in sys.partitions is unreliable. After all, if this count would be guaranteed accurate then we would not need DBCC UPDATEUSAGE(...) WITH COUNT_ROWS. There are several scenarios that historically would cause this counter to drift apart from reality (mostly minimally logged insert rollbacks), all I know of are fixed, but that still leaves the problems of 1) upgraded tables from earlier versions that had the bugs and 2) other, not yet discovered, bugs.

In addition, why do we need to "load" entire rows (as indicated in the post I linked) just to count them? Shouldn't indexes or PKs etc. be sufficient to count them?

This is not 100% true. There are at least 2 scenarios that do no 'load entire rows':

  • narrow rowstore indexes load just the 'index' row, which may be much smaller
  • columnstore data loads just the relevant column segments

And most of what I say above do not apply for Hekaton tables.

Is count(*) constant time in SQLite, and if not what are alternatives?

SQLite has a special optimization for COUNT(*) without a WHERE clause, where it goes through the table's B-tree pages and counts entries without actually loading the records.
However, this still requires that all the table's data (except overflow pages for large records) is visited, so the runtime is still O(n).

SQLite does not store a separate record count in the database because that would make all changes slower.

is SELECT COUNT(*) expensive?

COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause. For example:

mysql> SELECT COUNT(*) FROM student;

This optimization applies only to MyISAM tables only, because an exact row count is stored for this storage engine and can be accessed very quickly.

Source

As you said you use MyISAM and your query is for the whole table, it doesn't matter if its 1 or 100000 rows.

Count rows in an SQL table in O(1)

In some RDBM's this is O(1) (most notably MySQL), put AFAIK it is generally frowned upon and considered an "ugly performance hack". The reasons is that if you have transactions (which every real RDBM should have), the total number of rows in the table might or might not be equal to the total number you can see from the current transaction. This is why the server needs to check which rows are actually visible to your transaction, making it more O(n) than O(1).

If you want to optimize the process of getting the number of rows and are satisfied with an approximate result, most RDBM's have special "info" tables which hold information about the tables, including the approximate number of rows (again, it is not the exact number of rows because of the transactions).

Fastest way to count exact number of rows in a very large table?

Simple answer:

  • Database vendor independent solution = use the standard = COUNT(*)
  • There are approximate SQL Server solutions but don't use COUNT(*) = out of scope

Notes:

COUNT(1) = COUNT(*) = COUNT(PrimaryKey) just in case

Edit:

SQL Server example (1.4 billion rows, 12 columns)

SELECT COUNT(*) FROM MyBigtable WITH (NOLOCK)
-- NOLOCK here is for me only to let me test for this answer: no more, no less

1 runs, 5:46 minutes, count = 1,401,659,700

--Note, sp_spaceused uses this DMV
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'MyBigtable' AND (index_id < 2)

2 runs, both under 1 second, count = 1,401,659,670

The second one has less rows = wrong. Would be the same or more depending on writes (deletes are done out of hours here)

Count(*) vs Count(1) - SQL Server

There is no difference.

Reason:

Books on-line says "COUNT ( { [ [ ALL | DISTINCT ] expression ] | * } )"

"1" is a non-null expression: so it's the same as COUNT(*).
The optimizer recognizes it for what it is: trivial.

The same as EXISTS (SELECT * ... or EXISTS (SELECT 1 ...

Example:

SELECT COUNT(1) FROM dbo.tab800krows
SELECT COUNT(1),FKID FROM dbo.tab800krows GROUP BY FKID

SELECT COUNT(*) FROM dbo.tab800krows
SELECT COUNT(*),FKID FROM dbo.tab800krows GROUP BY FKID

Same IO, same plan, the works

Edit, Aug 2011

Similar question on DBA.SE.

Edit, Dec 2011

COUNT(*) is mentioned specifically in ANSI-92 (look for "Scalar expressions 125")

Case:

a) If COUNT(*) is specified, then the result is the cardinality of T.

That is, the ANSI standard recognizes it as bleeding obvious what you mean. COUNT(1) has been optimized out by RDBMS vendors because of this superstition. Otherwise it would be evaluated as per ANSI

b) Otherwise, let TX be the single-column table that is the
result of applying the <value expression> to each row of T
and eliminating null values. If one or more null values are
eliminated, then a completion condition is raised: warning-

What is the difference between count(0), count(1).. and count(*) in mySQL/SQL?

Nothing really, unless you specify a field in a table or an expression within parantheses instead of constant values or *

Let me give you a detailed answer. Count will give you non-null record number of given field. Say you have a table named A

select 1 from A
select 0 from A
select * from A

will all return same number of records, that is the number of rows in table A. Still the output is different. If there are 3 records in table. With X and Y as field names

select 1 from A will give you

1
1
1

select 0 from A will give you
0
0
0

select * from A will give you ( assume two columns X and Y is in the table )
X Y
-- --
value1 value1
value2 (null)
value3 (null)

So, all three queries return the same number. Unless you use

select count(Y) from A 

since there is only one non-null value you will get 1 as output

The Timecomplexity of Built in SQL Functions such as sum, count, avg

In SQL the math function complexity of aggregates is totally irelevant. The only thing that really matters is the data acceess complexity: what access path is chosen (table scan, index range scan, index seek etc) and how many pages are read. There may be slight differences in the internals of each aggregate, but they all work pretty much the same way (keep running state and compute running aggregation for each input value) and there is absolutely NO aggregate that looks at input twice, so they all O(n) as internal implementation, where 'n' is the number of recordes fed to the aggregate (not necesarily the number of record in the table!).

Some aggregates have internal shortcuts, eg. COUNT(*) may return the count from metadata on some systems, if possible.



Related Topics



Leave a reply



Submit