How to Index a Database Column

How does database indexing work?

Why is it needed?

When data is stored on disk-based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires (N+1)/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire tablespace must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, which has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

How does it work?

Firstly, let’s outline a sample database table schema;


Field name       Data type      Size on disk
id (Primary key) Unsigned INT   4 bytes
firstName        Char(50)       50 bytes
lastName         Char(50)       50 bytes
emailAddress     Char(100)      100 bytes

Note: char was used in place of varchar to allow for an accurate size on disk value.
This sample database contains five million rows and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

Example 1 - sorted vs unsorted fields

Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value, given that the id field is a key field. But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

Now the firstName field is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;


Field name       Data type      Size on disk
firstName        Char(50)       50 bytes
(record pointer) Special        4 bytes

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

Example 2 - indexing

Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilize the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 1,000,000 block accesses required to find a firstName match in the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indices can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.

How do I index a database column

The following is SQL92 standard so should be supported by the majority of RDMBS that use SQL:

CREATE INDEX [index name] ON [table name] ( [column name] )

How to see indexes for a database or table in MySQL?

To see the index for a specific table use SHOW INDEX:

SHOW INDEX FROM yourtable;

To see indexes for all tables within a specific schema you can use the STATISTICS table from INFORMATION_SCHEMA:

SELECT DISTINCT
    TABLE_NAME,
    INDEX_NAME
FROM INFORMATION_SCHEMA.STATISTICS
WHERE TABLE_SCHEMA = 'your_schema';

Removing the where clause will show you all indexes in all schemas.

How do I know when to index a column, and with what?

Think of an index very roughly like the index in the back of a book. It's a totally separate area from the content of the book, where if you are seeking some specific value, you can go to the index and look it up (indexes are ordered, so finding things there is much quicker than scanning every page of the book).

The index entry has a page number, so you can then quickly go to the page seeking your topic. A database index is very similar; it is an ordered list of the relevant information in your database (the field(s) included in the index), with information for the database to find the records which match.

So... you would create an index when you have information that you need to search on frequently. Normal indexes don't help you for 'partial' seeks like LIKE queries, but any time you need to get a set of results where field X has certain value(s), they keep the DBMS from needing to 'scan' the whole table, looking for matching values.

They also help when you need to sort on a column.

Another thing to keep in mind; If the DBMS allows you to create single indexes that have multiple fields, be sure to investigate the effects of doing so, specific to your DBMS. An index that includes multiple fields is likely only to be fully (or at all) useful if all those fields are being used in a query. Conversely, having multiple indexes for a single table, with one field per index, may not be of much (or any) help for queries that are filtering/sorting by multiple fields.

You mentioned Full Text indexes and PKs (Primary Keys). These are different than regular indexes, though they often serve similar purposes.

First, note that a Primary Key is usually an index (in MSSQL, a 'Clustered Index', in fact), but this does not need to be the case specifically. As an example, an MSSQL PK is a Clustered Index by default; clustered indexes are special in that they are not a separate bit of data stored elsewhere, but the data itself is arranged in the table in order by the Clustered Index. This is why a popular PK is an int value that is auto-generated with sequential, increasing values. So, a Clustered Index sorts the data in the table specifically by the field's value. Compare this to a traditional dictionary; the entries themselves are ordered by the 'key', which is the word being defined.

But in MSSQL (check your DBMS documentation for your information), you can change the Clustered Index to be a different field, if you like. Sometimes this is done on datetime based fields.

Full Text indexes are different kinds of beasts entirely. They use some of the same principles, but what they are doing isn't exactly the same as normal indexes, which I am describing. Also: in some DBMS's, LIKE queries do not use the full text index; special query operators are required.

These indexes are different because their intent is not to find/sort on the whole value of the column (a number, a date, a short bit of char data), but instead to find individual words/phrases within the text field(s) being indexed.

They can also often enable searching for similar words, different tenses, common misspellings and the like, and typically ignore noise words. The different way in which they work is why they also may need different operators to use them. (again, check your local documentation for your DBMS!)

What is an index in SQL?

An index is used to speed up searching in the database. MySQL have some good documentation on the subject (which is relevant for other SQL servers as well):
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

An index can be used to efficiently find all rows matching some column in your query and then walk through only that subset of the table to find exact matches. If you don't have indexes on any column in the WHERE clause, the SQL server has to walk through the whole table and check every row to see if it matches, which may be a slow operation on big tables.

The index can also be a UNIQUE index, which means that you cannot have duplicate values in that column, or a PRIMARY KEY which in some storage engines defines where in the database file the value is stored.

In MySQL you can use EXPLAIN in front of your SELECT statement to see if your query will make use of any index. This is a good start for troubleshooting performance problems. Read more here:
http://dev.mysql.com/doc/refman/5.0/en/explain.html

Indexing every column in a table

Indexing any table, either memory or file system based, will speed up queries that select or sort results based on that column. This is because the index works like a tree structure and the search distance depends on the depth of the tree, which increases a lot slower than the row count of the column (logarithmic).

Indexing every column does not defeat the purpose of the index, but it will slow up inserts and updates because those changes will cause an update of every index of that table. Also, the indexes take up space on the database server, so that is another drawback to be considered.

Other SO questions to read relating to this question:

Best practices for indexing

What is an index

How many indexes are enough

Decision when to create Index on table column in database?

but update of column value wont have any impact on index value. Right?

No. Updating an indexed column will have an impact. The Oracle 11g performance manual states that:

UPDATE statements that modify indexed columns and INSERT and DELETE
statements that modify indexed tables take longer than if there were
no index. Such SQL statements must modify data in indexes and data in
tables. They also create additional undo and redo.

So bottom line is when my column is used in join between two tables we should consider creating index on column used in join but all other columns can be skipped because if we create index on them it will involve extra cost of updating index value when new value is inserted in column. Right?

Not just Inserts but any other Data Manipulation Language statement.

Consider this scenario . . . Will index help here?

With regards to this last paragraph, why not build some test cases with representative data volumes so that you prove or disprove your assumptions about which columns you should index?