Id Best Practices for Databases

ID Best Practices for Databases

You are confusing internal database functionality with external search criteria.

Auto-increment surrogate keys are useful for internal application use. Never pass those on to the user. Identifying business objects, whether it is a user or an invoice, are done with unique information about the object, like SSN, CCN or DOB. Use as much info as necessary to uniquely identify the object.

I highly recommend that if you must supply some newly invented ID value to each customer, that it NOT be the field you link all the customer data tables on.

Best practice for choosing ids when get data from another system

The answer is: it depends. But what should you be thinking about? Here are some considerations.

Is the third-party id some convoluted long string? If so, just for performance reasons, you want your own id, because integers are more efficient for foreign key references.

Is the third-party id some sort of PII (personally identifiable information)? If so, you probably want to mask it, and an internal id is a very handy solution.

Is your definition of the entity 100% tied to the third party? If no, you may find that you want to combine or split existing ids. If that is the case, then an internal id is handy.

Is the third-party id the "canonical" way of referring to the thing? If so you might want to use it. For instance, I think it is fine to use ISO country codes as a country id. They are standard and rarely change.

Will the third-party id be used across multiple databases as "the thing"? If so, you might want to keep it as the key, so you don't have equivalent ids proliferating across multiple databases.

These are just some things to think about. There are no doubt others as well.

What is the best practice for creating user ID in the system?

As with most other things IT DEPENDS.

Something that seems like it should be such an easy issue can get quite complex pretty quickly. And INT or even BIGINT is significantly smaller than a GUID, but with only 300k or so records, it's not going to have a huge impact on size, but a GUID as your PRIMARY KEY WILL cause significant fragmentation. That will definitely affect your index and could cause issues with lookups if you need max performance.

If high security is a priority, one of the bad things about using sequential ids is that it makes it incredibly easy to numerate through your users when (not if) a bad actor discovers your id scheme. GUIDs fix this, and the chance of two people having the same GUID, even across multiple systems, is very small.

SQL 2008 does have newsequentialid(), which kinda helps some of the issues with newid(). It will significantly reduce fragmentation, but it will still require a lot more storage than in INT. https://learn.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql

Again, this is an issue that will require you to put some thought into, especially how it will impact your system. This is ultimately something you'll have to weigh within your expected system.

https://news.ycombinator.com/item?id=14523523

https://blogs.msdn.microsoft.com/sqlserverfaq/2010/05/27/guid-vs-int-debate/

And you should always pay attention to Ms. Tripp's advice. https://www.sqlskills.com/blogs/kimberly/guids-as-primary-keys-andor-the-clustering-key/

Though Jeff Atwood has a different opinion. As does Joe Celko. https://blog.codinghorror.com/primary-keys-ids-versus-guids/

There are about a bazillion links on Google that will give you multiple sides of this discussion. It really begins to approach a religious debate. Ultimately, I don't think anyone here can tell you what the "Best Practice" is. You'll have to try different methods in your system and see what works best for what you have.

Best practices on primary key, auto-increment, and UUID in SQL databases

It's a matter of choice actually and this question can raise opinion based answers from my point of view. What I always do, even if it's redundant is I create primary key on auto increment column (I call it technical key) to keep it consistent within the database, allow for "primary key" to change in case something went wrong at design phase and also allow for less space to be consumed in case that key is being pointed to by foreign key constraint in any other table and also I make the candidate key unique and not null.

Technical key is something you don't normally show to end users, unless you decide to. This can be the same for other technical columns that you're keeping only at database level for any purpose you may need like modify date, create date, version, user who changed the record and more.

In this case I would go for your second option, but slightly modified:

CREATE TABLE users(
  pk INT NOT NULL AUTO_INCREMENT,
  id UUID NOT NULL,
  .....
  PRIMARY KEY(pk),
  UNIQUE(id)
);

What are best practices for handling ids in web services?

The only reasonable way for front-end would be to someway identify persons in DB.

Matching the full entity is unreliable and isn't obvious; for returning hashed ID to front-end you need to receive not-hashed ID from front-end first, or perform some revertible "hashing" (more like "encrypting") under IDs, so anyway there would be some person identifier.

IMHO it does not matter whether it will be a database ID or some piece of data (encrypted database ID) from which the ID could be extracted. Why do you think that consumers knowing the database ID would be a bad idea? I don't see any problem as long as every person belongs to a single consumer.

If there is many-to-many relation between persons (objects in DB) and consumers, then you may "encrypt" (in the broad sense) the object id so that the encryption will be consumer-dependent. For example, in communication with consumer you can use the ID of the link (between object and consumer) entry in DB.

If sending IDs to consumers seems to be the bad idea for you because of the possibility of consumer enumerating all the IDs one-by-one, you can avoid this problem by using GUIDs instead of an integer auto-incremented IDs.

PS: As for your comment, consider using e.g. GUID as an object ID. The ID is the part of data, not the part of schema, so it will be preserved when migrating between databases. Such the ID won't contain sensitive information as well, so it is perfectly safe to reveal the ID to consumer (or someone else). If you want to prevent creation of two different persons with the same SSNs, just add an UNIQUE key on your SSN field, but do not use SSN as the part of ID, as such approach has many serious disadvantages, with inability to reveal the ID being the least of them.

What's the best practice for primary keys in tables?

I follow a few rules:

Primary keys should be as small as necessary. Prefer a numeric type because numeric types are stored in a much more compact format than character formats. This is because most primary keys will be foreign keys in another table as well as used in multiple indexes. The smaller your key, the smaller the index, the less pages in the cache you will use.
Primary keys should never change. Updating a primary key should always be out of the question. This is because it is most likely to be used in multiple indexes and used as a foreign key. Updating a single primary key could cause of ripple effect of changes.
Do NOT use "your problem primary key" as your logic model primary key. For example passport number, social security number, or employee contract number as these "natural keys" can change in real world situations. Make sure to add UNIQUE constraints for these where necessary to enforce consistency.

On surrogate vs natural key, I refer to the rules above. If the natural key is small and will never change it can be used as a primary key. If the natural key is large or likely to change I use surrogate keys. If there is no primary key I still make a surrogate key because experience shows you will always add tables to your schema and wish you'd put a primary key in place.

Exposing database IDs - security risk?

There are risks associated with exposing database identifiers. On the other hand, it would be extremely burdensome to design a web application without exposing them at all. Thus, it's important to understand the risks and take care to address them.

The first danger is what OWASP called "insecure direct object references." If someone discovers the id of an entity, and your application lacks sufficient authorization controls to prevent it, they can do things that you didn't intend.

Here are some good rules to follow:

Use role-based security to control access to an operation. How this is done depends on the platform and framework you've chosen, but many support a declarative security model that will automatically redirect browsers to an authentication step when an action requires some authority.
Use programmatic security to control access to an object. This is harder to do at a framework level. More often, it is something you have to write into your code and is therefore more error prone. This check goes beyond role-based checking by ensuring not only that the user has authority for the operation, but also has necessary rights on the specific object being modified. In a role-based system, it's easy to check that only managers can give raises, but beyond that, you need to make sure that the employee belongs to the particular manager's department.

There are schemes to hide the real identifier from an end user (e.g., map between the real identifier and a temporary, user-specific identifier on the server), but I would argue that this is a form of security by obscurity. I want to focus on keeping real cryptographic secrets, not trying to conceal application data. In a web context, it also runs counter to widely used REST design, where identifiers commonly show up in URLs to address a resource, which is subject to access control.

Another challenge is prediction or discovery of the identifiers. The easiest way for an attacker to discover an unauthorized object is to guess it from a numbering sequence. The following guidelines can help mitigate that:

Expose only unpredictable identifiers. For the sake of performance, you might use sequence numbers in foreign key relationships inside the database, but any entity you want to reference from the web application should also have an unpredictable surrogate identifier. This is the only one that should ever be exposed to the client. Using random UUIDs for these is a practical solution for assigning these surrogate keys, even though they aren't cryptographically secure.
One place where cryptographically unpredictable identifiers is a necessity, however, is in session IDs or other authentication tokens, where the ID itself authenticates a request. These should be generated by a cryptographic RNG.

What are the best practices for using a GUID as a primary key, specifically regarding performance?

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!

PS: of course, if you're dealing with just a few hundred or a few thousand rows - most of these arguments won't really have much of an impact on you. However: if you get into the tens or hundreds of thousands of rows, or you start counting in millions - then those points become very crucial and very important to understand.

Update: if you want to have your PKGUID column as your primary key (but not your clustering key), and another column MYINT (INT IDENTITY) as your clustering key - use this:

CREATE TABLE dbo.MyTable
(PKGUID UNIQUEIDENTIFIER NOT NULL,
 MyINT INT IDENTITY(1,1) NOT NULL,
 .... add more columns as needed ...... )

ALTER TABLE dbo.MyTable
ADD CONSTRAINT PK_MyTable
PRIMARY KEY NONCLUSTERED (PKGUID)

CREATE UNIQUE CLUSTERED INDEX CIX_MyTable ON dbo.MyTable(MyINT)

Basically: you just have to explicitly tell the PRIMARY KEY constraint that it's NONCLUSTERED (otherwise it's created as your clustered index, by default) - and then you create a second index that's defined as CLUSTERED

This will work - and it's a valid option if you have an existing system that needs to be "re-engineered" for performance. For a new system, if you start from scratch, and you're not in a replication scenario, then I'd always pick ID INT IDENTITY(1,1) as my clustered primary key - much more efficient than anything else!

Id Best Practices for Databases