Composite Primary Key VS Additional "Id" Column

Say that {Author, Title, Edition} uniquely identifies a book, then the following holds:

  1. It is a superkey -- uniquely identifies a tuple (row).

  2. It is irreducible -- removing any of the columns does not make it a key any more.

  3. It is a candidate key -- an irreducible superkey is a candidate key.

Now let's consider the ID (integer)

I can reason that the Book table key will show up in few other tables as a foreign key and also in few indexes. So, it will take quite a bit of space -- say three columns x 40 characters (or whatever...) -- in each of these tables plus in matching indexes.

In order to make these "other" tables and indexes smaller, I can add a unique-integer-column to the Book table to be used as a key which will be referenced as a foreign key. Say something like:

alter table Book add BookID integer not null identity;

With BookID being (must be) unique too, the Book table now has two candidate keys.

Now I can select the BookID as a primary key.

alter table Book add constraint pk_Book primary key (BookID);

However, the {Author,Title,Edition} must stay a key (unique) in order to prevent something like this:

BookID  Author      Title           Edition
1 C.J.Date Database Design 1
2 C.J.Date Database Design 1

To sum it up, adding the BookID -- and choosing it as the primary -- did not stop {Author, Title, Edition} being a (candidate) key. It still must have its own unique constraint and usually the matching index.

Also note that from the design point, this decision was done on the "physical level". In general, on the logical level of design, this ID does not exists -- it got introduced during the consideration of column sizes and indexes. So, the physical schema was derived from the logical one. Depending on the DB size, RDBMS and hardware used, none of that size-reasoning may have measurable effect -- so using {Author, Title, Edition} as a PK may be perfectly good design -- until proven differently.

Should I use composite primary keys in this example?

I would suggest to go with single column primary key instead on composite keys. The biggest drawback with composite key is that you require more than one value /columnto identify a row. If your application uses an O/RM (Object/Relation Mapping) layer, then you will have fits mapping these database rows to objects in a programming language. O/RM's are easiest to set up when every table has a single column primary key.

Programming aside,the major drawback of composite keys in general, and especially composite keys requiring this many columns, is all of this data needs to be specified and copied to child tables in order to set up proper relationships between tables which is wastage of space and it increase unnecessary complexity too.

The biggest headache I've run into with developers is they assume "uniqueness of data" equates "identifying a row in the database". This is rarely the case. I've found applications and databases to be much more maintainable and easy to build by defaulting to single column primary keys, and using composite keys as an exception to the rule, then enforcing data uniqueness by using unique constraints or indexes on those columns.

Identity column separate from composite primary key

I think most people don't use the graphical designer to do this, as it's the graphical designer that's preventing it, not SQL Server. Try running DDL in a query window:

CREATE UNIQUE INDEX yt_id ON dbo.YourTable(ID);

Now you can reference this column in other tables no problem:

CREATE TABLE dbo.SomeOtherTable

That said, I find the column name ID completely useless. If it's a MatchID, why not call it MatchID everywhere it appears in the schema? Yes it's redundant in the PK table but IMHO consistency throughout the model is more important.

For that matter, why is your table called SoccerMatch? Do you have other kinds of matches? I would think it would be Matches with a unique ID = MatchID. That way if you later have different types of matches you don't have to create a new table for each sport - just add a type column of some sort. If you only ever have soccer, then SoccerMatch is kind of redundant, no?

Also I would suggest that the key and unique index be the other way around. If you're not planning to use the multi-column key for external reference then it is more intuitive, at least to me, to make the PK the thing you do reference in other tables. So I would say:

CREATE TABLE dbo.Matches
EventDate DATE, -- Date is also a terrible name and it's reserved
Opponent <? data type ?> / FK reference?


UNIQUE (EventDate, Opponent);

Composite primary keys versus unique object ID field

Most of the commonly used engines (MS SQL Server, Oracle, DB2, MySQL, etc.) would not experience noticeable issues using a surrogate key system. Some may even experience a performance boost from the use of a surrogate, but performance issues are highly platform-specific.

In general terms, the natural key (and by extension, composite key) verses surrogate key debate has a long history with no likely “right answer” in sight.

The arguments for natural keys (singular or composite) usually include some the following:

1) They are already available in the data model. Most entities being modeled already include one or more attributes or combinations of attributes that meet the needs of a key for the purposes of creating relations. Adding an additional attribute to each table incorporates an unnecessary redundancy.

2) They eliminate the need for certain joins. For example, if you have customers with customer codes, and invoices with invoice numbers (both of which are "natural" keys), and you want to retrieve all the invoice numbers for a specific customer code, you can simply use "SELECT InvoiceNumber FROM Invoice WHERE CustomerCode = 'XYZ123'". In the classic surrogate key approach, the SQL would look something like this: "SELECT Invoice.InvoiceNumber FROM Invoice INNER JOIN Customer ON Invoice.CustomerID = Customer.CustomerID WHERE Customer.CustomerCode = 'XYZ123'".

3) They contribute to a more universally-applicable approach to data modeling. With natural keys, the same design can be used largely unchanged between different SQL engines. Many surrogate key approaches use specific SQL engine techniques for key generation, thus requiring more specialization of the data model to implement on different platforms.

Arguments for surrogate keys tend to revolve around issues that are SQL engine specific:

1) They enable easier changes to attributes when business requirements/rules change. This is because they allow the data attributes to be isolated to a single table. This is primarily an issue for SQL engines that do not efficiently implement standard SQL constructs such as DOMAINs. When an attribute is defined by a DOMAIN statement, changes to the attribute can be performed schema-wide using an ALTER DOMAIN statement. Different SQL engines have different performance characteristics for altering a domain, and some SQL engines do not implement DOMAINS at all, so data modelers compensate for these situations by adding surrogate keys to improve the ability to make changes to attributes.

2) They enable easier implementations of concurrency than natural keys. In the natural key case, if two users are concurrently working with the same information set, such as a customer row, and one of the users modifies the natural key value, then an update by the second user will fail because the customer code they are updating no longer exists in the database. In the surrogate key case, the update will process successfully because immutable ID values are used to identify the rows in the database, not mutable customer codes. However, it is not always desirable to allow the second update – if the customer code changed it is possible that the second user should not be allowed to proceed with their change because the actual “identity” of the row has changed – the second user may be updating the wrong row. Neither surrogate keys nor natural keys, by themselves, address this issue. Comprehensive concurrency solutions have to be addressed outside of the implementation of the key.

3) They perform better than natural keys. Performance is most directly affected by the SQL engine. The same database schema implemented on the same hardware using different SQL engines will often have dramatically different performance characteristics, due to the SQL engines data storage and retrieval mechanisms. Some SQL engines closely approximate flat-file systems, where data is actually stored redundantly when the same attribute, such as a Customer Code, appears in multiple places in the database schema. This redundant storage by the SQL engine can cause performance issues when changes need to be made to the data or schema. Other SQL engines provide a better separation between the data model and the storage/retrieval system, allowing for quicker changes of data and schema.

4) Surrogate keys function better with certain data access libraries and GUI frameworks. Due to the homogeneous nature of most surrogate key designs (example: all relational keys are integers), data access libraries, ORMs, and GUI frameworks can work with the information without needing special knowledge of the data. Natural keys, due to their heterogeneous nature (different data types, size etc.), do not work as well with automated or semi-automated toolkits and libraries. For specialized scenarios, such as embedded SQL databases, designing the database with a specific toolkit in mind may be acceptable. In other scenarios, databases are enterprise information resources, accessed concurrently by multiple platforms, applications, report systems, and devices, and therefore do not function as well when designed with a focus on any particular library or framework. In addition, databases designed to work with specific toolkits become a liability when the next great toolkit is introduced.

I tend to fall on the side of natural keys (obviously), but I am not fanatical about it. Due to the environment I work in, where any given database I help design may be used by a variety of applications, I use natural keys for the majority of the data modeling, and rarely introduce surrogates. However, I don’t go out of my way to try to re-implement existing databases that use surrogates. Surrogate-key systems work just fine – no need to change something that is already functioning well.

There are some excellent resources discussing the merits of each approach:

Is composite key better than regular ID as key and unique index on several columns?

One benefit is that an InnoDB* table uses the primary key as a clustered index, that is, it stores the table in clustered index order, and this makes primary key lookups more efficient than a secondary key lookup (even in cases when the secondary index is a unique key). If you always look up rows by your secondary unique columns and not by the id pseudokey, then there's an advantage to using those columns as your primary key.

On the other hand, using a bulky set of columns as your primary key increases the size of other secondary indexes, because in InnoDB, each secondary index leaf contains the values of primary keys where that value occurs. If your primary key is three integers, a date, and a varchar(32), then that adds more bulk to the secondary indexes. How much depends on your data types and how many secondary indexes you have.

There are also some cases of unavoidable deadlocks that can occur when you have both an auto-increment primary key and a secondary unique key.

There's also an outstanding bug on unique keys that allows them to get duplicate values in a rare race condition.

* I assume you're using InnoDB, because no one should use MyISAM.

Composite primary key? Or an auto increment primary key with unique composite index?

2 is the best.

It's better than 1 because it has one less index (so less space/cache is taken and the index maintenance is less expensive). A surrogate key (like Id) is generally useless in a junction table.

It may be better than 3 depending on your DBMS: some DBMSes will not allow you to cluster the table on a unique index (just on a primary key)1. Even on a DBMS that allows it (like SQL Server), I'd go with 1 for being more direct and having simpler syntax.

1 And some will not allow you to reference a unique index (just PRIMARY KEY or UNIQUE constraint) from a FOREIGN KEY, though that's probably not relevant here. If it was, than you'd consider the surrogate key Id.

Composite Primary Keys : Good or Bad?

There is no conclusion that composite primary keys are bad.

The best practice is to have some column or columns that uniquely identify a row. But in some tables a single column is not enough by itself to uniquely identify a row.

SQL (and the relational model) allows a composite primary key. It is a good practice is some cases. Or, another way of looking at it is that it's not a bad practice in all cases.

Some people have the opinion that every table should have an integer column that automatically generates unique values, and that should serve as the primary key. Some people also claim that this primary key column should always be called id. But those are conventions, not necessarily best practices. Conventions have some benefit, because it simplifies certain decisions. But conventions are also restrictive.

You may have an order with multiple payments because some people purchase on layaway, or else they have multiple sources of payment (two credit cards, for instance), or two different people want to pay for a share of the order (I frequently go to a restaurant with a friend, and we each pay for our own meal, so the staff process half of the order on each of our credit cards).

I would design the system you describe as follows:

Products  : product_id (PK)

Orders : order_id (PK)

LineItems : product_id is (FK) to Products
order_id is (FK) to Orders
(product_id, order_id) is (PK)

Payments : order_id (FK)
payment_id - ordinal for each order_id
(order_id, payment_id) is (PK)

This is also related to the concept of identifying relationship. If it's definitional that a payment exists only because an order exist, then make the order part of the primary key.

Note the LineItems table also lacks its own auto-increment, single-column primary key. A many-to-many table is a classic example of a good use of a composite primary key.

Should I use a huge composite primary key or just a unique id?

The only time I ever use a composite key is when it consists of two integer fields in a linking table for a many to many relationship. Use a surrogate key and then put a unique index on the fields you would have put into the composite key. This way you save space to child tables, have the improved speed of an integer join (I would not use a GUID unless I was actually going to use replication) and you have the uniqueness of the natural key preserved.

Why Composite Primary key when I can use Single Primary key with Unique constraints on composite columns?

You do not need a primary key to enforce uniqueness. You can use a unique constraint or index instead.

I am not a fan of composite primary keys. Here are some reasons:

  • All foreign key references have to include all the keys in the correct order and matching types. This makes is slightly more cumbersome to define those tables.
  • Because the composite keys are included in all referencing tables, those tables are often larger, which results in worse performance.
  • If you decide that you want to change the type of one of the component keys -- say the length of a string or an int to a numeric -- you have to modify lots and lots of tables.
  • When joining tables, you have to include all the keys. If you miss one . . . well, the code is syntactically correct but the results are wrong.

There are occasions where composite keys are acceptable, such as tables that have no foreign key references. Even in those cases, I use synthetic keys, but I totally understand the other perspective.

Should I use composite primary keys or not?

I think there's no problem using a composite key.

To me the database it's a component on its own, that should be treated the same way we treat code : for instance we want clean code, that communicates clearly its intent, that does one thing and does it well, that doesn't add any uneeded level of complexity, etc.

Same thing with the db, if the PK is composite, this is the reality, so the model should be kept clean and clear. A composite PK it's clearer than the mix auto-increment + constraint. When you see an ID column that does nothing you need to ask what's the real PK, are there any other hidden things that you should be aware of, etc. A clear PK doesn't leave any doubts.

The db is the base of your app, to me we need the most solid base that we can have. On this base we'll build the app ( web or not ). So I can't see why we should bend the db model to conform to some specific in one development tool/framework/language. The data is directing the application, not the other way around. What if the ORM changes in the future and becomes obsolete and a better solution appears that imposes another model ? We can't play with the db model to fit this or that framework, the model should stay the same, it should not depend on what tool we're using to access the data ...

If the db model change in the future, it should change because functionality changed. If we would know today how this functionality will change, we'll be modeling this already. ANd any future change will be dealt with when the time comes, we can't predict for instance the impact on existing data, so one extra column doesn't guarantee that it will withold any future change ...

We should design for today's functionality, and keep the db model the simplest possible, this way it will be easy to change/evolve in the future.

