Indexes and Multi Column Primary Keys

Indexes and multi column primary keys

I'm not intimately familiar with the internals of indices on mySql, but on the two database vendor products that I am familiar with (MsSQL, Oracle) indices are balanced-Tree structures, whose nodes are organized as a sequenced tuple of the columns the index is defined on (In the Sequence Defined)

So, unless mySql does it very differently, (probably not), any composite index (on more than one column) can be useable by any query that needs to filter or sort by a subset of the columns in the index, as long as the list of columns is compatible, i.e., if the columns, when sequenced the same as the sequenced list of columns in the complete index, is an ordered subset of the complete set of index columns, which starts at the beginning of the actual index sequence, with no gaps except at the end...

In other words, this means that if you have an index on (a,b,c,d) a query that filters on (a), (a,b), or (a,b,c) can also use the index, but a query that needs to filter on (b), or (c) or (b,c) will not be able to use the index...

So in your case, if you often need to filter or sort on column item alone, you need to add another index on that column by itself...

Necessary to create index on multi field primary key in SQL server?

If you created your primary key as:

CREATE TABLE TBL (UserID, SomeTypeID, SomeSubType, Data 
    CONSTRAINT PK PRIMARY KEY (UserID, SomeTypeID, SomeSubType))

Then the default index that is being created is a CLUSTERED index.

Usually (so not all times), when looking for data, you would want your queries to use a NON-CLUSTERED index to filter rows, where the columns you use to filter rows will form the key of the index and the information (column) that you return from those rows as an INCLUDED column, in this case DATA, like below:

CREATE NONCLUSTERED INDEX ncl_indx 
ON TBL (UserID, SomeTypeID, SomeSubType) INCLUDE (Data);

By doing this, you're avoiding accessing the table data, through the CLUSTERED index.

But, you can specify the type of index that you want your PRIMARY KEY to be, so:

CREATE TABLE TBL (UserID, SomeTypeID, SomeSubType, Data 
    CONSTRAINT PK PRIMARY KEY NONCLUSTERED (UserID, SomeTypeID, SomeSubType));

Buuut, because you want this to be defined as a PRIMARY KEY then you are not able to use the INCLUDE functionality, so you can't avoid the disk lookup in order to get the information from the DATA column, which is where you basically are with having the default CLUSTERED index.

Buuuuuut, there's still a way to ensure the uniqueness that the Primary Key gives you and benefit from the INCLUDE functionality, so as to do as fewer disk I/O's.

You can specify your NONCLUSTERED INDEX as UNIQUE which will ensure that all of your 3 columns that make up the index key are unique.

CREATE UNIQUE NONCLUSTERED INDEX ncl_indx 
ON TBL (UserID, SomeTypeID, SomeSubType) INCLUDE (Data);

By doing all of these then your table is going to be a HEAP, which is not a very good thing. If you've given it a good thought in designing your tables and decided that the best clustering key for your CLUSTERED INDEX is (UserID, SomeTypeID, SomeSubType), then it's best to leave everything as you currently have it.

Otherwise, if you have decided on a different clustering key then you can add this unique nonclustered index, if you're going to query the table as you said you will.

Does a composite primary key also create an index for each column separately?

In Teradata, if you specify a PRIMARY KEY clause when you create the table, then the table will automatically be created with a UNIQUE PRIMARY INDEX (UPI) on those PK columns. Although Teradata supports keys, it is more of an index-based DBMS.

In your case, you will have very, very fast reads (i.e. UPI access - single AMP, single row) only when you specify all of the fields in your PK. This applies to equality access as mentioned in the previous comments (thanks Dieter).

If you access the table on some but not ALL of the PK / UPI columns, then your query won't use the UPI access path. You'd need to define separate indexes or other optimization strategies, depending on your queries.

If you only care about read performance, then it makes sense to create secondary indexes on the separate columns. Just run the EXPLAIN on your query to make sure the indexes are actually being used by the Optimizer.

Another option is to ditch the PK specification altogether, especially if you never access the table on that group of columns. If there is one column you access more than the other, specify that one as your PRIMARY INDEX (non-unique) and create a secondary index on the other one. Something like:

CREATE TABLE mytable (
  A INTEGER,
  B INTEGER,
  C VARCHAR(10)
)
PRIMARY INDEX(A) -- Non-unique primary index
;

CREATE INDEX (B) ON mytable; -- Create secondary index

Multi column primary keys?

If you use the former design, with the superfluous identity column, there's no constraint against inserting two rows with identical UserID and TargetID. You'd have to create a UNIQUE constraint over the other two columns, which creates a compound index anyway.

On the other hand, some frameworks (e.g. Rails) insist that every table has a surrogate key named id so the "correct" design may not work. It depends on what code you're writing to use this table design.

Does making a primary key in multiple columns generate indexes for all of them?

No, indexes will not be created for the individual fields.

If you have a composit key FieldA, FieldB, FieldC and you

select * from MyTable where FieldA = :a

select * from MyTable where FieldA = :a and FieldB = :b

Then it will use this index (because it they are the first two fields in the key)

If you have

select * from MyTable where FieldB = :b and FieldC = :c

Where you are using parts of the index, but not the full index, the index will be used less efficiently through an index skip scan, full index scan, or fast full index scan.

(Thanks to David Aldridge for the correction)

Update using range over multiple primary key columns doesn't use index

Update

In response to Toni's comment, below, I tried to use a tuple as I would in python, and it worked much better than my original suggestion. Based on the analyze output, the implicit row(id1, id2) is compatible with the index backing the PK.

select version();
                                                                version                                                                 
----------------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 10.12 (Ubuntu 10.12-0ubuntu0.18.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0, 64-bit
(1 row)

explain analyze 
 select * from testidx_array 
  where (id1, id2) between (8, 150) and (9, 2000);

                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testidx_array  (cost=423.81..1263.91 rows=19855 width=40) (actual time=1.772..4.148 rows=11851 loops=1)
   Recheck Cond: ((ROW(id1, id2) >= ROW(8, 150)) AND (ROW(id1, id2) <= ROW(9, 2000)))
   Heap Blocks: exact=54
   ->  Bitmap Index Scan on testidx_array_pkey  (cost=0.00..418.84 rows=19855 width=0) (actual time=1.722..1.722 rows=11851 loops=1)
         Index Cond: ((ROW(id1, id2) >= ROW(8, 150)) AND (ROW(id1, id2) <= ROW(9, 2000)))
 Planning time: 0.096 ms
 Execution time: 4.867 ms
(7 rows)

Old Answer, Below, Superseded

You should be able to force the use of the index by specifying the range for your pk1 and then including an and for the array[pk1, pk2] condition.

where pk1 between $first_pk1_value and $last_pk1_value
  and array[pk1, pk2] between array[$first_pk1_value, $first_pk2_value] 
                          and array[$last_pk1_value, $last_pk2_value]

This worked for me in a test table:

\d testidx_array
            Table "public.testidx_array"
  Column  |  Type   | Collation | Nullable | Default 
----------+---------+-----------+----------+---------
 id1      | integer |           | not null | 
 id2      | integer |           | not null | 
 somedata | text    |           |          | 
Indexes:
    "testidx_array_pkey" PRIMARY KEY, btree (id1, id2)

explain analyze
 select * from testidx_array 
  where array[id1, id2] between array[8,150] and array[9,2000];

                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on testidx_array  (cost=0.00..1943.00 rows=500 width=40) (actual time=42.011..50.758 rows=11851 loops=1)
   Filter: ((ARRAY[id1, id2] >= '{8,150}'::integer[]) AND (ARRAY[id1, id2] <= '{9,2000}'::integer[]))
   Rows Removed by Filter: 88149
 Planning time: 0.151 ms
 Execution time: 51.325 ms
(5 rows)

explain analyze 
 select * from testidx_array 
  where id1 between 8 and 9 
    and array[id1, id2] between array[8,150] and array[9,2000];

                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on testidx_array  (cost=418.85..1258.91 rows=99 width=40) (actual time=2.278..11.109 rows=11851 loops=1)
   Recheck Cond: ((id1 >= 8) AND (id1 <= 9))
   Filter: ((ARRAY[id1, id2] >= '{8,150}'::integer[]) AND (ARRAY[id1, id2] <= '{9,2000}'::integer[]))
   Rows Removed by Filter: 8149
   Heap Blocks: exact=90
   ->  Bitmap Index Scan on testidx_array_pkey  (cost=0.00..418.82 rows=19853 width=0) (actual time=2.138..2.138 rows=20000 loops=1)
         Index Cond: ((id1 >= 8) AND (id1 <= 9))
 Planning time: 0.289 ms
 Execution time: 11.693 ms
(9 rows)

Why use multiple columns as primary keys (composite primary key)

Your understanding is correct.

You would do this in many cases. One example is in a relationship like OrderHeader and OrderDetail. The PK in OrderHeader might be OrderNumber. The PK in OrderDetail might be OrderNumber AND LineNumber. If it was either of those two, it would not be unique, but the combination of the two is guaranteed unique.

The alternative is to use a generated (non-intelligent) primary key, for example in this case OrderDetailId. But then you would not always see the relationship as easily. Some folks prefer one way; some prefer the other way.

Multi-Column Primary Key or Unique Constraint?

First of all - yes, a compound primary key made up of three columns makes it more annoying to join to this table - any other table wanting to join to the Country table will also have to have all three columns to establish the join.

And more importantly - it's NOT the same restriction!

If you have the PK on ID and a unique constraint on both Code and Name separately, then this is NOT valid:

ID    Code     Name
--------------------------
41    CH       Switzerland
341   CH       Liechtenstein
555   LIE      Liechtenstein

because the Code of CH and the Name of Liechtenstein both appear twice.

But if you have a single PK on all three columns together - then this is valid because each row has a different tuple of data - (41, CH, Switzerland) is not the same as (341, CH, Liechtenstein) and therefore, those two rows are admissible.

If you put the PK on all three columns at once, then the uniqueness only extends to the whole tuple (all three columns) - each column separately can have "duplicates".

So it really boils down to

what you really need (which uniqueness)
how easy you want to make it to join to this table

Indexes and Multi Column Primary Keys