How to Choose My Primary Key

How to choose my primary key?

I believe that in practice using a natural key is rarely better than a surrogate key.

The following are the main disadvantages of using a natural key as the primary key:

  • You might have an incorrect key value, or you may simply want to rename a key value. To edit it, you would have to update all the tables that would be using it as a foreign key.

  • It is often difficult to have a truly unique natural key.

  • Natural keys are often strings. An index on an numeric field will be much more compact than one on a string field.

There is no hard rule on what the data type of the primary key should be. A numeric key normally performs better, but you could use a string, especially if the table is not big, and the tables that reference it are not big either.

What's the best practice for primary keys in tables?

I follow a few rules:

  1. Primary keys should be as small as necessary. Prefer a numeric type because numeric types are stored in a much more compact format than character formats. This is because most primary keys will be foreign keys in another table as well as used in multiple indexes. The smaller your key, the smaller the index, the less pages in the cache you will use.
  2. Primary keys should never change. Updating a primary key should always be out of the question. This is because it is most likely to be used in multiple indexes and used as a foreign key. Updating a single primary key could cause of ripple effect of changes.
  3. Do NOT use "your problem primary key" as your logic model primary key. For example passport number, social security number, or employee contract number as these "natural keys" can change in real world situations. Make sure to add UNIQUE constraints for these where necessary to enforce consistency.

On surrogate vs natural key, I refer to the rules above. If the natural key is small and will never change it can be used as a primary key. If the natural key is large or likely to change I use surrogate keys. If there is no primary key I still make a surrogate key because experience shows you will always add tables to your schema and wish you'd put a primary key in place.

What should I consider when selecting a data type for my primary key?

Sorry to do that, but I found that the answers I gave to related questions (you can check this and this) could apply to this one. I reshaped them a little bit...

You will find many posts dealing with this issue, and each choice you'll make has its pros and cons. Arguments for these usually refer to relational database theory and database performance.

On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Natural keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons: field too short, rules change, etc.

To this point, you've guessed here that I am basically a member of the uniqueIdentifier/surrogate primary key team, and even if I appreciate and understand arguments such as the ones presented here, I am still looking for the case where "natural" key is better than surrogate ...

In addition to this, one of the most important but always forgotten arguments in favor of this basic rule is related to code normalization and productivity:

each time I create a table, shall I lose time

  1. identifying its primary key and its physical characteristics (type, size)
  2. remembering these characteristics each time I want to refer to it in my code?
  3. explaining my PK choice to other developers in the team?

My answer is no to all of these questions:

  1. I have no time to lose trying to identify "the best Natural Primary Key" when the surrogate option gives me a bullet-proof solution.
  2. I do not want to remember that the Primary Key of my Table_whatever is a 10 characters long string when I write the code.
  3. I don't want to lose my time negotiating the Natural Key length: "well if You need 10 why don't you take 12 to be on the safe side?". This "on the safe side" argument really annoys me: If you want to stay on the safe side, it means that you are really not far from the unsafe side! Choose surrogate: it's bullet-proof!

So I've been working for the last five years with a very basic rule: each table (let's call it 'myTable') has its first field called 'id_MyTable' which is of uniqueIdentifier type. Even if this table supports a "many-to-many" relation, where a field combination offers a very acceptable Primary Key, I prefer to create this 'id_myManyToManyTable' field being a uniqueIdentifier, just to stick to the rule, and because, finally, it does not hurt.

The major advantage is that you don't have to care anymore about the use of Primary Key and/or Foreign Key within your code. Once you have the table name, you know the PK name and type. Once you know which links are implemented in your data model, you'll know the name of available foreign keys in the table.

And if you still want to have your "Natural Key" somewhere in your table, I advise you to build it following a standard model such as

Tbl_whatever

id_whatever, unique identifier, primary key
code_whatever, whateverTypeYouWant(whateverLengthYouEstimateTheRightOne), indexed
.....

Where id_ is the prefix for primary key, and code_ is used for "natural" indexed field. Some would argue that the code_ field should be set as unique. This is true, and it can be easily managed either through DDL or external code. Note that many "natural" keys are calculated (invoice numbers), so they are already generated through code

I am not sure that my rule is the best one. But it is a very efficient one! If everyone was applying it, we would for example avoid time lost answering to this kind of question!

Best practice to choose primary keys in a relational DB, what is the smartest solution?

Even if a table has a good natural key, it is still generally preferable to assign a surrogate key (usually a numeric auto increment column).

First, as jarlh points out, even countries can and do change their names from time to time, which you can handle easily with a CountryID value.

Also, though, many times a natural key is composed of character data. SQL deals with numbers faster than it deals with characters, so there is a performance boost using numeric ID values.

And it's currently the standard practice in data warehousing, so developers are accustomed to seeing those SK columns.

Best practice? Probably. Standard practice? Definitely. Go with the autoincrements.

The best choice for Person table primary key

As mentioned above, use an auto-increment as your primary key. But I don't believe this is your real question.

Your real question is how to avoid duplicate entries. In theory, there is no way - 2 people could be born on the same day, with the same name, and live in the same household, and not have a social insurance number available for one or the other. (One might be a foreigner visiting the country).

However, the combination of full name, birthdate, address, and telephone number is usually sufficient to avoid duplication. Note that addresses may be entered differently, people may have multiple phone numbers, and people may choose to omit their middle name or use an initial. It depends on how important it is to avoid duplicate entries, and how large is your userbase (and thus the likelihood of a collision).

Of course, if you can get the SSN/SIN then use that to determine uniqueness.

How do you like your primary keys?

If you're going to be doing any syncing between databases with occasionally connected apps, then you should be using GUIDs for your primary keys. It is kind of a pain for debugging, so apart from that case I tend to stick to ints that autoincrement.

Autoincrement ints should be your default, and not using them should be justified.

what to choose for primary key in table?

My suggestion is to create a simple INT datatype primary key and make it AUTO_INCREMENT if you are using MySQL, IDENTITY if you are using MS SQL Server or SEQUENCE if you are using Oracle. The INT datatype for primary key is very good if you need performance.

The natural key should be an Indexed column for better search.

Should each and every table have a primary key?

Short answer: yes.

Long answer:

  • You need your table to be joinable on something
  • If you want your table to be clustered, you need some kind of a primary key.
  • If your table design does not need a primary key, rethink your design: most probably, you are missing something. Why keep identical records?

In MySQL, the InnoDB storage engine always creates a primary key if you didn't specify it explicitly, thus making an extra column you don't have access to.

Note that a primary key can be composite.

If you have a many-to-many link table, you create the primary key on all fields involved in the link. Thus you ensure that you don't have two or more records describing one link.

Besides the logical consistency issues, most RDBMS engines will benefit from including these fields in a unique index.

And since any primary key involves creating a unique index, you should declare it and get both logical consistency and performance.

See this article in my blog for why you should always create a unique index on unique data:

  • Making an index UNIQUE

P.S. There are some very, very special cases where you don't need a primary key.

Mostly they include log tables which don't have any indexes for performance reasons.



Related Topics



Leave a reply



Submit