Fastest way to count exact number of rows in a very large table?
Simple answer:
- Database vendor independent solution = use the standard =
COUNT(*)
- There are approximate SQL Server solutions but don't use COUNT(*) = out of scope
Notes:
COUNT(1) = COUNT(*) = COUNT(PrimaryKey) just in case
Edit:
SQL Server example (1.4 billion rows, 12 columns)
SELECT COUNT(*) FROM MyBigtable WITH (NOLOCK)
-- NOLOCK here is for me only to let me test for this answer: no more, no less
1 runs, 5:46 minutes, count = 1,401,659,700
--Note, sp_spaceused uses this DMV
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'MyBigtable' AND (index_id < 2)
2 runs, both under 1 second, count = 1,401,659,670
The second one has less rows = wrong. Would be the same or more depending on writes (deletes are done out of hours here)
MySQL: Fastest way to count number of rows
When you COUNT(*)
it takes in count column indexes, so it will be the best result. MySQL with MyISAM engine actually stores row count, it doesn't count all rows each time you try to count all rows. (based on primary key's column)
Using PHP to count rows is not very smart, because you have to send data from MySQL to PHP. Why do it when you can achieve the same on the MySQL side?
If the COUNT(*)
is slow, you should run EXPLAIN
on the query, and check if indexes are really used, and where they should be added.
The following is not the fastest way, but there is a case, where COUNT(*)
doesn't really fit - when you start grouping results, you can run into a problem where COUNT
doesn't really count all rows.
The solution is SQL_CALC_FOUND_ROWS
. This is usually used when you are selecting rows but still need to know the total row count (for example, for paging).
When you select data rows, just append the SQL_CALC_FOUND_ROWS
keyword after SELECT:
SELECT SQL_CALC_FOUND_ROWS [needed fields or *] FROM table LIMIT 20 OFFSET 0;
After you have selected needed rows, you can get the count with this single query:
SELECT FOUND_ROWS();
FOUND_ROWS()
has to be called immediately after the data selecting query.
In conclusion, everything actually comes down to how many entries you have and what is in the WHERE statement. You should really pay attention on how indexes are being used, when there are lots of rows (tens of thousands, millions, and up).
Faster COUNT(*) with large row count
I can think of two options for such a situation.
First, when the id being compared is part of a composite primary key, then be sure that the id being used is the first key in the primary key. If you have to do the comparison on multiple keys, then use a secondary index for each subkey.
Second, if these are insert
-only tables with increasing primary keys, then you can insert a cumulative count at each step. This requires a trigger or other logic on the insert.
Then you could retrieve the count by looking up the extreme values. That is two lookups rather than an index scan.
Fast way to discover the row count of a table in PostgreSQL
Counting rows in big tables is known to be slow in PostgreSQL. The MVCC model requires a full count of live rows for a precise number. There are workarounds to speed this up dramatically if the count does not have to be exact like it seems to be in your case.
(Remember that even an "exact" count is potentially dead on arrival under concurrent write load.)
Exact count
Slow for big tables.
With concurrent write operations, it may be outdated the moment you get it.
SELECT count(*) AS exact_count FROM myschema.mytable;
Estimate
Extremely fast:
SELECT reltuples AS estimate FROM pg_class where relname = 'mytable';
Typically, the estimate is very close. How close, depends on whether ANALYZE
or VACUUM
are run enough - where "enough" is defined by the level of write activity to your table.
Safer estimate
The above ignores the possibility of multiple tables with the same name in one database - in different schemas. To account for that:
SELECT c.reltuples::bigint AS estimate
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname = 'mytable'
AND n.nspname = 'myschema';
The cast to bigint
formats the real
number nicely, especially for big counts.
Better estimate
SELECT reltuples::bigint AS estimate
FROM pg_class
WHERE oid = 'myschema.mytable'::regclass;
Faster, simpler, safer, more elegant. See the manual on Object Identifier Types.
Replace 'myschema.mytable'::regclass
with to_regclass('myschema.mytable')
in Postgres 9.4+ to get nothing instead of an exception for invalid table names. See:
- How to check if a table exists in a given schema
Better estimate yet (for very little added cost)
We can do what the Postgres planner does. Quoting the Row Estimation Examples in the manual:
These numbers are current as of the last
VACUUM
orANALYZE
on the
table. The planner then fetches the actual current number of pages in
the table (this is a cheap operation, not requiring a table scan). If
that is different fromrelpages
thenreltuples
is scaled
accordingly to arrive at a current number-of-rows estimate.
Postgres uses estimate_rel_size
defined in src/backend/utils/adt/plancat.c
, which also covers the corner case of no data in pg_class
because the relation was never vacuumed. We can do something similar in SQL:
Minimal form
SELECT (reltuples / relpages * (pg_relation_size(oid) / 8192))::bigint
FROM pg_class
WHERE oid = 'mytable'::regclass; -- your table here
Safe and explicit
SELECT (CASE WHEN c.reltuples < 0 THEN NULL -- never vacuumed
WHEN c.relpages = 0 THEN float8 '0' -- empty table
ELSE c.reltuples / c.relpages END
* (pg_catalog.pg_relation_size(c.oid)
/ pg_catalog.current_setting('block_size')::int)
)::bigint
FROM pg_catalog.pg_class c
WHERE c.oid = 'myschema.mytable'::regclass; -- schema-qualified table here
Doesn't break with empty tables and tables that have never seen VACUUM
or ANALYZE
. The manual on pg_class
:
If the table has never yet been vacuumed or analyzed,
reltuples
contains-1
indicating that the row count is unknown.
If this query returns NULL
, run ANALYZE
or VACUUM
for the table and repeat. (Alternatively, you could estimate row width based on column types like Postgres does, but that's tedious and error-prone.)
If this query returns 0
, the table seems to be empty. But I would ANALYZE
to make sure. (And maybe check your autovacuum
settings.)
Typically, block_size
is 8192. current_setting('block_size')::int
covers rare exceptions.
Table and schema qualifications make it immune to any search_path
and scope.
Either way, the query consistently takes < 0.1 ms for me.
More Web resources:
- The Postgres Wiki FAQ
- The Postgres wiki pages for count estimates and count(*) performance
TABLESAMPLE SYSTEM (n)
in Postgres 9.5+
SELECT 100 * count(*) AS estimate FROM mytable TABLESAMPLE SYSTEM (1);
Like @a_horse commented, the added clause for the SELECT
command can be useful if statistics in pg_class
are not current enough for some reason. For example:
- No
autovacuum
running. - Immediately after a large
INSERT
/UPDATE
/DELETE
. TEMPORARY
tables (which are not covered byautovacuum
).
This only looks at a random n % (1
in the example) selection of blocks and counts rows in it. A bigger sample increases the cost and reduces the error, your pick. Accuracy depends on more factors:
- Distribution of row size. If a given block happens to hold wider than usual rows, the count is lower than usual etc.
- Dead tuples or a
FILLFACTOR
occupy space per block. If unevenly distributed across the table, the estimate may be off. - General rounding errors.
Typically, the estimate from pg_class
will be faster and more accurate.
Answer to actual question
First, I need to know the number of rows in that table, if the total
count is greater than some predefined constant,
And whether it ...
... is possible at the moment the count pass my constant value, it will
stop the counting (and not wait to finish the counting to inform the
row count is greater).
Yes. You can use a subquery with LIMIT
:
SELECT count(*) FROM (SELECT 1 FROM token LIMIT 500000) t;
Postgres actually stops counting beyond the given limit, you get an exact and current count for up to n rows (500000 in the example), and n otherwise. Not nearly as fast as the estimate in pg_class
, though.
BigQuery / Count the number of rows until a specific row is reached 2
As pointed out by @Schwern, if you don't have a column giving you an idea of the order of the events, you cannot get the result you expect.
That being said, here is a solution if you have a event_date
or event_timestamp
column:
WITH temp AS(
SELECT
id,
event,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY event_date) AS rownum
FROM
sample )
SELECT
id,
event,
rownum-COALESCE(LAG(rownum) OVER(PARTITION BY id ORDER BY rownum), 0)-1 AS count_events
FROM
temp
WHERE
event = 'approved'
With the data you provided, it returns the desired output:
The logic behind the query is to say that the count of 'pending's before an 'approved' is the position of the 'approved' (it's row number) minus the position of the previous 'approved' minus 1.
How to count a number of unique rows considering that A, B row = B, A row
You can use aggregation:
select (case when col1 < col2 then col1 else col2 end) as least,
(case when col1 < col2 then col2 else col3 end) as greatest,
count(*)
from t
group by (case when col1 < col2 then col1 else col2 end),
(case when col1 < col2 then col2 else col3 end);
Many databases support the least()
and greatest()
functions which simplify this logic a bit.
SQL to return a row where the count of rows in a table joined by a primary key is an exact number
Aggregation can be made to work. We can add another assertion to the HAVING
clause which rules out the presence of any assetId
other than 101 and 102.
SELECT s.outfitid, COUNT(a.outfitid)
FROM savedoutfits s
INNER JOIN accessories a ON s.outfitid = a.outfitid
WHERE s.param = 'a'
GROUP BY s.outfitid
HAVING COUNT(DISTINCT a.assetid) = 2 AND
COUNT(CASE WHEN a.assetid NOT IN (101, 102) THEN 1 END) = 0;
Note that I have also switched from left to inner join, which seems to be the logic you want here.
Fastest way to count the rows in any database table?
count(column_name)
is not inaccurate, it's simply something completely different than count(*).
The SQL standard defines count(column_name)
as equivalent to count(*) where column_name IS NOT NULL
. To the result is bound to be different if column_name is nullable.
In Oracle (and possibly other DBMS as well), count(*)
will use an available index on a not null column to count the rows (e.g. PK index). So it will be just as fas
Additionally there is nothing similar to the rowid in SQL Server or MySQL (in PostgreSQL it would be ctid
).
Do use count(*)
. It's the best option to get the row count. Let the DBMS do any optimization in the background if adequate indexes are available.
Edit
A quick demo on how Oracle automatically uses an index if available and how that reduces the amount of work done by the database:
The setup of the test table:
create table foo (id integer not null, c1 varchar(2000), c2 varchar(2000));
insert into foo (id, c1, c2)
select lvl, c1, c1 from
(
select level as lvl, dbms_random.string('A', 2000) as c1
from dual
connect by level < 10000
);
That generates 10000 rows with each row filling up some space in order to make sure the table has a realistic size.
Now in SQL*Plus I run the following:
SQL> set autotrace traceonly explain statistics;
SQL> select count(*) from foo;
Execution Plan
----------------------------------------------------------
Plan hash value: 1342139204
-------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
-------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 2740 (1)| 00:00:33 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | TABLE ACCESS FULL| FOO | 9999 | 2740 (1)| 00:00:33 |
-------------------------------------------------------------------
Statistics
----------------------------------------------------------
181 recursive calls
0 db block gets
10130 consistent gets
0 physical reads
0 redo size
430 bytes sent via SQL*Net to client
420 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
As you can see a full table scan is done on the table which requires 10130 "IO Operations" (I know that that is not the right term, but for the sake of the demo it should be a good enough explanation for someone never seen this before)
Now I create an index on that column and run the count(*) again:
SQL> create index i1 on foo (id);
Index created.
SQL> select count(*) from foo;
Execution Plan
----------------------------------------------------------
Plan hash value: 129980005
----------------------------------------------------------------------
| Id | Operation | Name | Rows | Cost (%CPU)| Time |
----------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | | |
| 2 | INDEX FAST FULL SCAN| I1 | 9999 | 7 (0)| 00:00:01 |
----------------------------------------------------------------------
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
27 consistent gets
21 physical reads
0 redo size
430 bytes sent via SQL*Net to client
420 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
As you can see Oracle did use the index on the (not null!) column and the amount of IO went drastically down (from 10130 to 27 - not something I'd call "grossly ineffecient").
The "physical reads" stem from the fact that the index was just created and was not yet in the cache.
I would expect other DBMS to apply the same optimizations.
Related Topics
Fast Way to Discover the Row Count of a Table in Postgresql
How to Return Multiple Values in One Column (T-Sql)
Select Count(*) from Multiple Tables
Spark SQL Window Function With Complex Condition
When No 'Order By' Is Specified, What Order Does a Query Choose For Your Record Set
MySQL::Error: Specified Key Was Too Long; Max Key Length Is 1000 Bytes
How to Store a List in a Column of a Database Table
Is There a Max Function in SQL Server That Takes Two Values Like Math.Max in .Net
Pivot on Multiple Columns Using Tablefunc
Get Day of Week in SQL Server 2005/2008
Group by Behavior When No Aggregate Functions Are Present in the Select Clause
How to Store Only Time; Not Date and Time
Fastest Way to Remove Non-Numeric Characters from a Varchar in SQL Server