Gaps Between Primary Key Id in SQL Table

Gaps between primary key id in sql table

This is the way sequences work.

Important: To avoid blocking concurrent transactions that obtain
numbers from the same sequence, a nextval operation is never rolled
back; that is, once a value has been fetched it is considered used and
will not be returned again. This is true even if the surrounding
transaction later aborts, or if the calling query ends up not using
the value.

As pointed out in the comments there's no harm in having gaps in sequences. If you delete some rows in your table for what ever reason you are creating gaps in your primary key values and you wouldn't normally bother with resetting them to make them sequential.

If you insist on creating a gapless sequence, read this article: http://www.varlena.com/GeneralBits/130.php and be prepared for slow inserts.

Do gaps in the Identity (primary key clustered) table affects performance of database?

Expanding on my comment to explain a bit further. For clarity the comment was:

No, there will be no impact on performance. Since this is your clustering key, whether or not the seed is 45,322, or 2,000,0001, the record will still be entered onto the next free space on the clustered index after the record 45,321. The value of an Identity column is intended to be meaningless, if it is not you are probably not using correctly. After a big delete like that you may end up with some index fragmentation, but the identity seed is completely unrelated to this.

With regard to the fragmentation, on a very simplified example you might have a table with 5 pages, with 100 records per page:

Page 1 (IDs 1 - 100)
Page 2 (IDs 101 - 200)
Page 3 (IDs 201 - 300)
Page 4 (IDs 301 - 400)
Page 5 (IDs 401 - 500)

Now, if you do your delete, and remove all records where the last digit is not 1, and all records where the ID is over 300 you get:

Page 1 (IDs 1, 11, 21, 31, 41, 51, 61, 71, 81, 91)
Page 2 (IDs 11, 111, 121, 131, 141, 151, 161, 171, 181, 191)
Page 3 (IDs 21, 211, 221, 231, 241, 251, 261, 271, 281, 291)
Page 4 (Empty)
Page 5 (Empty)

When we now insert to this table, whether the next identity is 291, or 501, it doesn't change anything. The pages must remain in the correct order, so the highest ID is 291, so the next record must be inserted after that, on the same page if there is space, otherwise a new page is created. In this case there are 9 empty slots on the 3rd page, so the next record can be inserted there. Since both 292, and 500 are higher than 291, the behaviour is the same.

In both cases the issue remains that after the delete you have 3 pages with lots of free space (only 10% full), you now only have 30 records, which will happily fit on one page, so you could rebuild your index to do this, so that now you only need to read a single page to get all your data.

Again, I stress, that this is a very simple example, and I would not suggest rebuilding a clustered index to free up 2 pages!

It is also important to stress that this behaviour is because the ID column is the clustering key, NOT the primary key. They are not necessarily one and the same, however, if you were clustering on something other than your identity column it would still make no difference to performance if you were to reseed it or not after the delete. Identity columns are there for purely that, identity, the actual value is irrelevant as long as you can uniquely identify a row.

SAMPLE TEST CODE

-- CREATE TABLE AND FILL WITH 100,000 ROWS
IF OBJECT_ID(N'dbo.DefragTest', 'U') IS NOT NULL
    DROP TABLE dbo.DefragTest;

CREATE TABLE dbo.DefragTest (ID INT IDENTITY(1, 1) PRIMARY KEY, Filler CHAR(1) NULL);
INSERT dbo.DefragTest (Filler)
SELECT TOP 100000 NULL
FROM sys.all_objects AS a, sys.all_objects AS b;

-- CHECK PAGE STATISTICS
SELECT  Stage = 'After Initial Insert',
        IdentitySeed = IDENT_CURRENT(N'dbo.DefragTest'),
        p.rows, 
        a.total_pages,
        a.data_pages,
        AvgRecordsPerPage = CAST(p.rows / CAST(a.data_pages AS FLOAT) AS DECIMAL(10, 2))
FROM    sys.partitions AS p
        LEFT JOIN sys.allocation_units AS a
            ON a.container_id = p.partition_id
WHERE   p.[object_id] = OBJECT_ID(N'dbo.DefragTest', 'U')
AND     p.index_id IN (0, 1); -- CLUSTERED OR HEAP

-- DELETE RECORDS
DELETE  dbo.DefragTest
WHERE   ID % 10 != 1
OR      ID > 50000;

-- CHECK PAGE STATISTICS
SELECT  Stage = 'After Delete',
        IdentitySeed = IDENT_CURRENT(N'dbo.DefragTest'),
        p.rows, 
        a.total_pages,
        a.data_pages,
        AvgRecordsPerPage = CAST(p.rows / CAST(a.data_pages AS FLOAT) AS DECIMAL(10, 2))
FROM    sys.partitions AS p
        LEFT JOIN sys.allocation_units AS a
            ON a.container_id = p.partition_id
WHERE   p.[object_id] = OBJECT_ID(N'dbo.DefragTest', 'U')
AND     p.index_id IN (0, 1); -- CLUSTERED OR HEAP  

-- RESEED (REMOVED FOR ONE RUN)
DBCC CHECKIDENT ('dbo.DefragTest', RESEED, 50000);

--INSERT ROWS TO SEE EFFECT ON PAGE
INSERT dbo.DefragTest (Filler)
SELECT TOP 10000 NULL
FROM sys.all_objects AS a;

-- CHECK PAGE STATISTICS
SELECT  Stage = 'After Second Insert',
        IdentitySeed = IDENT_CURRENT(N'dbo.DefragTest'),
        p.rows, 
        a.total_pages,
        a.data_pages,
        AvgRecordsPerPage = CAST(p.rows / CAST(a.data_pages AS FLOAT) AS DECIMAL(10, 2))
FROM    sys.partitions AS p
        LEFT JOIN sys.allocation_units AS a
            ON a.container_id = p.partition_id
WHERE   p.[object_id] = OBJECT_ID(N'dbo.DefragTest', 'U')
AND     p.index_id IN (0, 1); -- CLUSTERED OR HEAP  

-- CHECK READS REQUIRED FOR FULL TABLE SCAN
SET STATISTICS IO ON;
SELECT COUNT(Filler)
FROM dbo.DefragTest;

-- REBUILD INDEX
ALTER INDEX PK_DefragTest__ID ON dbo.DefragTest REBUILD;

-- CHECK PAGE STATISTICS
SELECT  Stage = 'After Index Rebuild',
        IdentitySeed = IDENT_CURRENT(N'dbo.DefragTest'),
        p.rows, 
        a.total_pages,
        a.data_pages,
        AvgRecordsPerPage = CAST(p.rows / CAST(a.data_pages AS FLOAT) AS DECIMAL(10, 2))
FROM    sys.partitions AS p
        LEFT JOIN sys.allocation_units AS a
            ON a.container_id = p.partition_id
WHERE   p.[object_id] = OBJECT_ID(N'dbo.DefragTest', 'U')
AND     p.index_id IN (0, 1); -- CLUSTERED OR HEAP  

-- CHECK READS REQUIRED FOR FULL TABLE SCAN

SELECT COUNT(Filler)
FROM dbo.DefragTest;

SET STATISTICS IO OFF;

Output with Reseed:

Stage                   IdentitySeed    rows    total_pages     data_pages  AvgRecordsPerPage
After Initial Insert    100000          100000  178             174         574.71
After Delete            100000          5000    90              87          57.47
After Second Insert     52624           7624    98              91          83.78
After Index Rebuild     52624           7624    18              14          544.57

Table 'DefragTest'. Scan count 1, logical reads 93 (Count before rebuild)

Table 'DefragTest'. Scan count 1, logical reads 16 (Count after rebuild)

Output without Reseed:

Stage                   IdentitySeed    rows    total_pages     data_pages  AvgRecordsPerPage
After Initial Insert    100000          100000  178             174         574.71
After Delete            100000          5000    90              87          57.47
After Second Insert     102624          7624    98              91          83.78
After Index Rebuild     52624           7624    18              14          544.57

Table 'DefragTest'. Scan count 1, logical reads 93 (Count before rebuild)

Table 'DefragTest'. Scan count 1, logical reads 16 (Count after rebuild)

As you can see, in each case there is no difference, in the way the data is stored or read, it is only the value of IDENT_INCR() that changes, and in both cases rebuilding the clustered index drastically reduces the number of pages, which in turn improves query performance since there are less logical reads to get the same amount of data.

Why are there gaps in my IDENTITY column values?

Do not expect the identities to be consecutive. There are many scenarios that can leave gaps. Consider the identity like an abstract number and do not attach any business meaning to it.

Do gaps in primary key slow down queries?

Normally, I would say this isn't an issue. However, this catches my eye:

For example, there's only 183 rows, but I'm already at 71511 for the id column.

This is a lot of gaps. The usual reasons for gaps are deletions and failed inserts (and in other databases, allocation of blocks of ids for efficiency).

You seem to understand the gaps. You can eliminate them with some slight of hand on the inserts:

insert into t ( . . . )
    select . . .  --values here
    from dual
    where not exists (select 1 from t t2 where . . .)  -- duplicate key catch here
    on duplicate key . . .;

You will only need the duplicate key in the event of race conditions -- the where will eliminate most or all of them before the insert. This will probably eliminate all gaps.

Having to use a bigint for a table with 100,000 rows is more than inelegant. Big integers double the size of the indexes. They increase the amount of storage needed for a record on each page. Increasing the size of the database increases the I/O overhead.

How can you find ID gaps in a MySQL recordset?

First of all, what advantage are you trying to get by reusing the skipped values? An ordinary INT UNSIGNED will let you count up to 4,294,967,295. With "millions of records" your database would have to grow a thousand times over before running out of valid IDs. (And then using a BIGINT UNSIGNED will bump you up to 18,446,744,073,709,551,615 values.)

Trying to recycle values MySQL has skipped is likely to use up a lot of your time trying to compensate for something that really doesn't bother MySQL in the first place.

With that said, you can find missing IDs with something like:

SELECT id + 1
FROM the_table
WHERE NOT EXISTS (SELECT 1 FROM the_table t2 WHERE t2.id = the_table.id + 1);

This will find only the first missing number in each sequence (e.g., if you have {1, 2, 3, 8, 10} it will find {4,9}) but it's likely to be efficient, and of course once you've filled in an ID you can always run it again.

auto increment primary leaving gaps in counting

This is by design and will always happen.

Why?

Let's take 2 overlapping transaction that are doing INSERTs

Transaction 1 does an INSERT, gets the value (let's say 42), does more work
Transaction 2 does an INSERT, gets the value 43, does more work

Then

Transaction 1 fails. Rolls back. 42 stays unused
Transaction 2 completes with 43

If consecutive values were guaranteed, every transaction would have to happen one after the other. Not very scalable.

Also see Do Inserted Records Always Receive Contiguous Identity Values (SQL Server but same principle applies)

MySQL finding gaps in column with multiple ID

You can do this with not exists:

select s.*
from sequence s
where not exists (select 1 from sequence s2 where s2.id = s.id and s2.value = s.value + 1) and
      exists (select 1 from sequence s2 where s2.id = s.id and s2.value > s.value);

The exists clause is important so you don't report the final value for each id.

EDIT:

Here is a better approach:

select s.value + 1 as startgap,
       (select min(s2.value) - 1 from sequence s2 where s2.id = s.id and s2.value > s.value) as endgap
from sequence s
where not exists (select 1 from sequence s2 where s2.id = s.id and s2.value = s.value + 1) and
      exists (select 1 from sequence s2 where s2.id = s.id and s2.value > s.value);

Gaps Between Primary Key Id in SQL Table