Guid Primary /Foreign Key Dilemma SQL Server

Guid Primary /Foreign Key dilemma SQL Server

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

  • GUIDs as PRIMARY KEY and/or clustered key
  • The clustered index debate continues
  • Ever-increasing clustering key - the Clustered Index Debate..........again!

So if you really must change your primary keys to GUIDs - try to make sure the primary key isn't the clustering key, and you still have an INT IDENTITY field on the table that is used as the clustering key. Otherwise, your performance is sure to tank and take a severe hit .

What are the best practices for using a GUID as a primary key, specifically regarding performance?

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

  • GUIDs as PRIMARY KEY and/or clustered key
  • The clustered index debate continues
  • Ever-increasing clustering key - the Clustered Index Debate..........again!
  • Disk space is cheap - that's not the point!

PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.

Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:

CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
MyINT INT IDENTITY(1,1) NOT NULL,
.... add more columns as needed ...... )

ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)

CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)

Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED

This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!

Could using Sequential Guid as primary key in SQL Server lead to low performance for big data?

GUID may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the primary key of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

  1. The primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

  2. The clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based primary / clustered key into two separate keys - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as primary and clustering key:

  • Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
  • 6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

And yes - a larger size of a table or of an index automatically means more data pages that need to be loaded from disk, held in memory, transferred to the client - all negatively impacting your performance. How much impact that has really depends on lots of factors of your database design and your data distribution, so any generalized predictions are next to impossible...

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

  • GUIDs as PRIMARY KEY and/or clustered key
  • The clustered index debate continues
  • Ever-increasing clustering key - the Clustered Index Debate..........again!
  • Disk space is cheap - that's not the point!

Allowing duplicate uniqueidentifiers in SQL Server 2008?

Just because the datatype is UNIQUEIDENTIFIER doesn't imply you cannot have the same value in that column multiple times!

You CAN in fact - unless of course, you've explicitly added a UNIQUE CONSTRAINT or UNIQUE INDEX on that column - this is your choice, but there's nothing applied by default, unless you do something about it yourself.

So you should be able to reference HardwareType from InputType using the UNIQUEIDENTIFIER - even if multiple rows in InputType will reference the same row in HardwareType - no problems at all.

TSQL Copy Dilemma with Uniqueidentifier

Inserting new values using helper variable to hold generated NEWID:

CREATE TABLE #main(main_id UNIQUEIDENTIFIER PRIMARY KEY);

CREATE TABLE #sub(sub_id VARCHAR(100) PRIMARY KEY,
main_id UNIQUEIDENTIFIER FOREIGN KEY REFERENCES #main(main_id)
);

DECLARE @new_guid UNIQUEIDENTIFIER = NEWID();

INSERT INTO #main(main_id)
VALUES (@new_guid);

INSERT INTO #sub(sub_id, main_id)
VALUES ('abc', @new_guid);

SELECT *
FROM #main

SELECT *
FROM #sub;

Copying data from one table to another:

INSERT INTO #main_new(main_id);
SELECT main_id
FROM #main;

INSERT INTO #sub_new(sub_id, main_id)
SELECT sub_id, main_id
FROM #sub;

EDIT:

Example of reinserting data using join and temporary table and then insert specific columns to tables.

SELECT NEWID() as guid, sub_id, [rn] = ROW_NUMBER() OVER(ORDER BY m.main_id)
INTO #temporary
FROM #main m
JOIN #sub s
ON m.main_id = s.main_id

INSERT INTO #main
SELECT guid
FROM #temporary;

INSERT INTO #sub
SELECT CONCAT(sub_id, rn), guid
FROM #temporary;

SELECT *
FROM #main;

SELECT *
FROM #sub;

Int Autoincrementing Primary Key and GUID column

Generally speaking, an auto-incrementing key will not collide; in fact, I'm not sure how to make it happen (since most definitions in SQL include a Unique constraint). You might get duplicate key errors, though. Why are you letting outside vendors dictate what your internal keys are? Accept a completed record from them, and generate the keys yourself. If you're running out of keys, increase the size of the column (to a long). You could generate a new GUID column, yes, but you'd likely still be generating the GUIDs anyways, so what would be the difference?

Often, most tables in a database will have more than one 'unique' key. These are the 'natural' keys - the actual columns that make up a unique set of data; unfortunately, this may end up being the entire width of the table (which is why id columns are used). Also, unless there is some bizarre use case, don't make the server know about the client ids - make the clients know about the server ids, and keep track of it in a separate column (if necessary) on the client. Then can then have their own internal ids that they use before uploading, that have no relation to the keys you give back - and which they never share with the server.

It doesn't really matter what the server id is (int based or a GUID), although I'd recommend sticking with whatever you have. Whenever the client gives you a row to put on the server, give them back the generated id from the server. This prevents clients from attempting to insert already-in-use ids, and related issues (how do you know the id they gave you shouldn't have been generated by some other device?). I would say that clients do not get to dictate internal ids on the server



EDIT:

In light of some quick research into replication (in response to HLGEM), and re-reading the original question; I initially took the question as asking about letting clients dictate the internal primary keys (which is still bad), and/or replacing the int primary keys with the GUID. Although your exact needs may not require it, adding an extra GUID column that the client can populate would work. Some recommendations:

1) Treat unique key violations (accidental or malicious) on the GUID column as a business exception, not a system exception. I'd probably recommend telling the client to re-generate the GUID and re-submit (silently).
2) The client never gets to dictate the acutal primary key column values in the database. In fact, if you use a GUID column, the client may never even need to know about them.

Primary key as INT and global key as GUID for improved performance

I've done that before and it worked successfully: as you point out, using a GUID meant that we avoided conflicts when merging, say, data from one database with another, and the int provided us with efficient joining etc. I would just never use a GUID as a key when you're dealing with OLAP, as that will performance.



Related Topics



Leave a reply



Submit