Int VS Unique-Identifier for Id Field in Database

INT vs Unique-Identifier for ID field in database

GUIDs are problematic as clustered keys because of the high randomness. This issue was addressed by Paul Randal in the last Technet Magazine Q&A column: I'd like to use a GUID as the clustered index key, but the others are arguing that it can lead to performance issues with indexes. Is this true and, if so, can you explain why?

Now bear in mind that the discussion is specifically about clustered indexes. You say you want to use the column as 'ID', that is unclear if you mean it as clustered key or just primary key. Typically the two overlap, so I'll assume you want to use it as clustered index. The reasons why that is a poor choice are explained in the link to the article I mentioned above.

For non clustered indexes GUIDs still have some issues, but not nearly as big as when they are the leftmost clustered key of the table. Again, the randomness of GUIDs introduces page splits and fragmentation, be it at the non-clustered index level only (a much smaller problem).

There are many urban legends surrounding the GUID usage that condemn them based on their size (16 bytes) compared to an int (4 bytes) and promise horrible performance doom if they are used. This is slightly exaggerated. A key of size 16 can be a very peformant key still, on a properly designed data model. While is true that being 4 times as big as a int results in more a lower density non-leaf pages in indexes, this is not a real concern for the vast majority of tables. The b-tree structure is a naturally well balanced tree and the depth of tree traversal is seldom an issue, so seeking a value based on GUID key as opposed to a INT key is similar in performance. A leaf-page traversal (ie. a table scan) does not look at the non-leaf pages, and the impact of GUID size on the page size is typically quite small, as the record itself is significantly larger than the extra 12 bytes introduced by the GUID. So I'd take the hear-say advice based on 'is 16 bytes vs. 4' with a, rather large, grain of salt. Analyze on individual case by case and decide if the size impact makes a real difference: how many other columns are in the table (ie. how much impact has the GUID size on the leaf pages) and how many references are using it (ie. how many other tables will increase because of the fact they need to store a larger foreign key).

I'm calling out all these details in a sort of makeshift defense of GUIDs because they been getting a lot of bad press lately and some is undeserved. They have their merits and are indispensable in any distributed system (the moment you're talking data movement, be it via replication or sync framework or whatever). I've seen bad decisions being made out based on the GUID bad reputation when they were shun without proper consideration. But is true, if you have to use a GUID as clustered key, make sure you address the randomness issue: use sequential guids when possible.

And finally, to answer your question: if you don't have a specific reason to use GUIDs, use INTs.

GUID vs INT IDENTITY

Kimberley Tripp (SQLSkills.com) has an article on using GUID's as primary keys. She advices against it because of the unnecessary overhead.

T-SQL performance penalty of using UNIQUEIDENTIFIER vs. INT

GUID is not the only option for surrogate values to be unique across databases. Two options come to mind immediately:

  1. Create your own hashing. Use a BIGINT and set the identity seed on each database, starting with, say, 1000000000 on database A, 2000000000 on database B, etc. This supports many, many databases.
  2. Have a central sequence generator that just bumps up a central identity column and gives the new id back to be used in the insert. Scale and concurrency might be more of a factor here, but only at the high end I'd suspect.

In SQL Server 2012, you will be able to implement (2) much easier using SEQUENCE.

Is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?

That depends on what you're doing:

  • If speed is the primary concern then a plain old int is probably big enough.
  • If you really will have more than 2 billion (with a B ;) ) records, then use bigint or a sequential guid.
  • If you need to be able to easily synchronize with records created remotely, then Guid is really great.

Update

Some additional (less-obvious) notes on Guids:

  • They can be hard on indexes, and that cuts to the core of database performance
  • You can use sequential guids to get back some of the indexing performance, but give up some of the randomness used in point two.
  • Guids can be hard to debug by hand (where id='xxx-xxx-xxxxx'), but you get some of that back via sequential guids as well (where id='xxx-xxx' + '123').
  • For the same reason, Guids can make ID-based security attacks more difficult- but not impossible. (You can't just type 'http://example.com?userid=xxxx' and expect to get a result for someone else's account).

Uniqueidentifier vs. IDENTITY vs. Material Code --which is the best choice for primary key?

GUID may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based primary / clustered key into two separate keys - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as primary and clustering key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

  • GUIDs as PRIMARY KEY and/or clustered key
  • The clustered index debate continues
  • Ever-increasing clustering key - the Clustered Index Debate..........again!
  • Disk space is cheap - that's not the point!

Unless you have a very good reason, I would argue to use a INT IDENTITY for almost every "real" data table as the default for their primary key - it's unique, it's stable (never changes), it's narrow, it's ever increasing - all the good properties that you want to have in a clustering key for fast and reliable performance of your SQL Server tables!

If you have some "natural" key value that also has all those properties, then you might also use that instead of a surrogate key. But two variable-length strings of max. 20 chars each do not meet those requirements in my opinion.

The differences between INT and UUID in MySQL

UUID returns a universal unique identifier (hopefuly also unique if imported to another DB as well).

To quote from MySQL doc (emphasis mine):

A UUID is designed as a number that is globally unique in space and
time
. Two calls to UUID() are expected to generate two different
values, even if these calls are performed on two separate computers
that are not connected to each other.

On the other hand a simply INT primary id key (e.g. AUTO_INCREMENT) will return a unique integer for the specific DB and DB table, but which is not universally unique (so if imported to another DB chances are there will be primary key conflicts).

In terms of performance, there shouldn't be any noticeable difference using auto-increment over UUID. Most posts (including some by the authors of this site), state as such. Of course UUID may take a little more time (and space), but this is not a performance bottleneck for most (if not all) cases. Having a column as Primary Key should make both choices equal wrt to performance. See references below:

  1. To UUID or not to UUID?
  2. Myths, GUID vs Autoincrement
  3. Performance: UUID vs auto-increment in cakephp-mysql
  4. UUID performance in MySQL?
  5. Primary Keys: IDs versus GUIDs (coding horror)

(UUID vs auto-increment performance results, adapted from Myths, GUID vs Autoincrement)

Sample Image

UUID pros / cons (adapted from Primary Keys: IDs versus GUIDs)

GUID Pros

  • Unique across every table, every database, every server
  • Allows easy merging of records from different databases
  • Allows easy distribution of databases across multiple servers
  • You can generate IDs anywhere, instead of having to roundtrip to the database
  • Most replication scenarios require GUID columns anyway

GUID Cons

  • It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if
    you're not careful
  • Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
  • The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of
    clustered indexes.

Note

I would read carefully the mentioned references and decide whether to use UUID or not depending on my use case. That said, in many cases UUIDs would be indeed preferable. For example one can generate UUIDs without using/accessing the database at all, or even use UUIDs which have been pre-computed and/or stored somewhere else. Plus you can easily generalise/update your database schema and/or clustering scheme without having to worry about IDs breaking and causing conflicts.

In terms of possible collisions, for example using v4 UUIDS (random), the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.

Unique identifier (guid) as primary key in database design

A 128-bit GUID (uniqueidentifier) key is of course 4x larger than a 32-bit int key. However, there are a few key advantages:

  • No "IDENTITY INSERT" issue when merging content
  • If you use a COMB value instead of NEWSEQUENTIALID(), you get a "free" INSERT timestamp. You can even SELECT from the primary key based on a date/time range if you want with a few fancy CAST() calls.
  • They are globally unique, which turns out to be pretty handy now and then.
  • Since there's no need to track high-water marks, your BL layer can assign the value rather than SQL Server, thus eliminating the step of SELECT scope_identity() to get the primary key after an insert.
  • If it's even remotely possible that you could have more than 2 billion records, you'll need to use bigint (64 bits) instead of int. Once you do that, uniqueidentifier is only twice as big as a bigint.
  • Using GUIDs makes it safer to expose keys in URLs, etc. without exposing yourself to "guess-the-ID" attacks.
  • Between how SQL Server loads pages from disk and how processors are now mostly 64-bit, just because a number is 128 bits instead of 32 doesn't mean it takes 4x longer to compare. The last test I saw showed that GUIDs are nearly as fast.
  • Index size depends on how many columns are included. Even though the GUIDs themselves are larger, the extra 8 or 12 bytes may be insignificant compared to the other columns in the index.

In the end, squeezing out some small performance advantage by using integers may not be worth losing the advantages of a GUID. Test it empirically and decide for yourself.

Personally, I still use both, depending on the situation, but the deciding factor has never really come down to performance in my case.



Related Topics



Leave a reply



Submit