When Should You Consider Indexing Your SQL Tables

When should you consider indexing your sql tables?

There's no good reason to forego obvious indexes (FKs, etc.) when you're creating the table. It will never noticeably affect performance to have unnecessary indexes on tiny tables, and it's good to take a first cut when your mind is into schema design. Also, some indexes serve to prevent duplicates, which can be useful regardless of table size.

I guess the proper answer to your question is that the number of records in the table should have nothing to do with when to create indexes.

How to determine if an Index is required or necessary

I use Jason Strate's index analysis scripts. They tell you how much your existing indexes are used as well as how much missing indexes would have been used. I typically don't add indexes unless they make up more than 5 or 10% of the queries on a table.

Most importantly, though, it's about making sure the application responds fast enough for the users.

Jason Strate's index analysis blog articles)

These days, I use sp_BlitzIndex® when performing index analysis.

When should I create database indexes?

Many (most?) DBMS use indexes to support unique constraints. Always create indexes to enforce unique constraints; they (the constraints) are crucial to the correct operation of your database.

Where you have a choice of how to create the index on multiple columns, put the column that will always be referenced in the queries ahead of other fields — usually. This is best if the leading column is also somewhat selective.

After you have the constraints necessary for uniqueness, consider those needed to enforce referential integrity. They are usually mandated by the DBMS too. Again, you cannot afford to have your database in a state of disintegrity — it is a logical system, and if it contains fallacies, you can prove anything from it, which is not helpful.

After the uniqueness and referential integrity constraints are dealt with (indexed), then you may or may not benefit from some others. Choose carefully, and add as few extras as possible (zero is a good number). Each index slows up update operations (UPDATE, INSERT, DELETE) and uses storage space. The intention is that it should win its place by speeding up queries. However, don't forget that the optimizer has to think about each index and whether it can be useful in answering the query, so indexes also slow down the optimizer (though you'd probably be hard-pressed to measure that effect).

When you do add indexes, add them on selective columns (not 'sex' containing 'M' and 'F', but maybe 'dob' containing dates of birth between 1900 and 2010, or maybe even more distinct values than that. Consider whether extra columns will help answer more queries. Some DBMS (such as DB2) provide for indexes with extra columns that are not part of the uniqueness constraint but which provide columns that are frequently used in the query. These can allow an index-only scan (one which does not need to access the table data because the needed values are all in the index).

There is much more that could be said, but this covers a lot of the territory.

As few indexes as necessary — and preferably no extras.

How to decide when use index on table column

In general, my indexing strategy would be something like this (I'm using SQL Server exclusively for now - adapt to your own database system as needed):

pick a good clustering key - not a GUID, not a VARCHAR(250) or something - a good clustering key is narrow, unique, stable, ever-increasing - something like a INT IDENTITY is perfect. Makes this your clustered primary key -> gives you your first index on the table
for any column that is being used as a foreign key into another table - add an index. It can either be a single column index - or it might be a compound index - whatever works best for your case. It's important that the foreign key column be the first column in that index (if you're using a compound index) - otherwise, the benefits for the JOIN's or for checking referential integrity won't be available to your system

And that's it for now.

Then: run your system - observe, and measure - establish a baseline. Is the app fast enough? If yes -> you're done - go home and enjoy your spare time.

If not: then start collecting data and indications as to why the app isn't fast enough. Look at e.g. things like the DMV's in SQL Server that tell you about the worst performing queries, or the missing index DMV. Analyze those. See what you could improve. Add one index at a time and again: observe, measure, compare to your baseline.

If you have improvement -> leave that index in place and this measurement is your new baseline. Rinse and repeat until you (and your users) are happy with the app's performance (and then go home and enjoy your time off).

Over-indexing in SQL Server can be worse than not having any indexes. Don't start out with too many indices to begin with! Only establish good clustered PK and foreign key nonclustered indices - that's all - then observe, measure, optimize & repeat that cycle.

Decision when to create Index on table column in database?

but update of column value wont have any impact on index value. Right?

No. Updating an indexed column will have an impact. The Oracle 11g performance manual states that:

UPDATE statements that modify indexed columns and INSERT and DELETE
statements that modify indexed tables take longer than if there were
no index. Such SQL statements must modify data in indexes and data in
tables. They also create additional undo and redo.

So bottom line is when my column is used in join between two tables we should consider creating index on column used in join but all other columns can be skipped because if we create index on them it will involve extra cost of updating index value when new value is inserted in column. Right?

Not just Inserts but any other Data Manipulation Language statement.

Consider this scenario . . . Will index help here?

With regards to this last paragraph, why not build some test cases with representative data volumes so that you prove or disprove your assumptions about which columns you should index?

Is it better to create an index before filling a table with data, or after the data is in place?

Creating index after data insert is more efficient way (it even often recomended to drop index before batch import and after import recreate it).

Syntetic example (PostgreSQL 9.1, slow development machine, one million rows):

CREATE TABLE test1(id serial, x integer);
INSERT INTO test1(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 7816.561 ms
CREATE INDEX test1_x ON test1 (x);
-- Time: 4183.614 ms

Insert and then create index - about 12 sec

CREATE TABLE test2(id serial, x integer);
CREATE INDEX test2_x ON test2 (x);
-- Time: 2.315 ms
INSERT INTO test2(id, x) SELECT x.id, x.id*100 FROM generate_series(1,1000000) AS x(id);
-- Time: 25399.460 ms

Create index and then insert - about 25.5 sec (more than two times slower)

When does a database table get large enough that an index is beneficial?

For the queries involving small portions of the table rows, indexes are always beneficial, be there 100 rows or 1,000,000.

See this entry in my blog for examples with plans and performance details:

Indexing tiny tables

The queries like this:

SELECT  *
FROM    table1 t1
JOIN    table2 t2
ON      t2.col = t1.col

will most probably use HASH JOIN. A hash table for the smaller table will be built, and the rows from the larger table will be used to probe the hash table.

To do this, no index is needed.

However, this query:

SELECT  *
FROM    table1 t1
JOIN    table2 t2
ON      t2.col = t1.col
WHERE   t1.othercol = @value

will use NESTED LOOPS: the rows from the outer table (table1) will be searched using an index on table1.othercol, and the rows from the inner table (table2) will be searched using an index on table2.col.

If you don't have an index on col1, a HASH JOIN will be used which requires scanning all rows from both tables and some more resources to built a hash table.

Indexes are also useful for the queries like this:

SELECT  t2.col
FROM    table1 t1
JOIN    table2 t2
ON      t2.col = t1.col

, in which case the engine doesn't need to read table2 itself at all: eveything you need for this query can be found in the index, which can be much smaller than the table itself and more efficient to read.

And, of course, if you need your data sorted and have indexes on both table1.col and table2.col, then the following query:

SELECT  *
FROM    table1 t1
JOIN    table2 t2
ON      t2.col = t1.col
ORDER BY
        t2.col

will probably use MERGE JOIN method, which is super fast if both input rowset are sorted, and its output is also sorted, which means that ORDER BY comes free.

Note that even if you don't have an index, an optimizer may choose to Eager Spool your small table, which means building a temporary index for the duration of the query and dropped the index after the query finishes.

If the query is small, it will be very fast, but again, an index won't hurt (for SELECT queries I mean). If the optimizer won't need it, it just will not be used.

Note, though, that creating an index may affect DML performance, but it's other story.

Is there a downside to adding numerous indexes to tables?

Indexes make it faster to search tables, but longer to write to. Having unused indexes will end come causing some unnecessary slow down.

How does database indexing work?

Why is it needed?

When data is stored on disk-based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires (N+1)/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire tablespace must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, which has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and a pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indices require additional space on the disk since the indices are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

How does it work?

Firstly, let’s outline a sample database table schema;


Field name       Data type      Size on disk
id (Primary key) Unsigned INT   4 bytes
firstName        Char(50)       50 bytes
lastName         Char(50)       50 bytes
emailAddress     Char(100)      100 bytes

Note: char was used in place of varchar to allow for an accurate size on disk value.
This sample database contains five million rows and is unindexed. The performance of several queries will now be analyzed. These are a query using the id (a sorted key field) and one using the firstName (a non-key unsorted field).

Example 1 - sorted vs unsorted fields

Given our sample database of r = 5,000,000 records of a fixed size giving a record length of R = 204 bytes and they are stored in a table using the MyISAM engine which is using the default block size B = 1,024 bytes. The blocking factor of the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The total number of blocks required to hold the table is N = (r/bfr) = 5000000/5 = 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000 block accesses to find a value, given that the id field is a key field. But since the id field is also sorted, a binary search can be conducted requiring an average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see this is a drastic improvement.

Now the firstName field is neither sorted nor a key field, so a binary search is impossible, nor are the values unique, and thus the table will require searching to the end for an exact N = 1,000,000 block accesses. It is this situation that indexing aims to correct.

Given that an index record contains only the indexed field and a pointer to the original record, it stands to reason that it will be smaller than the multi-field record that it points to. So the index itself requires fewer disk blocks than the original table, which therefore requires fewer block accesses to iterate through. The schema for an index on the firstName field is outlined below;


Field name       Data type      Size on disk
firstName        Char(50)       50 bytes
(record pointer) Special        4 bytes

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the size of the table.

Example 2 - indexing

Given our sample database of r = 5,000,000 records with an index record length of R = 54 bytes and using the default block size B = 1,024 bytes. The blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records per disk block. The total number of blocks required to hold the index is N = (r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilize the index to increase performance. This allows for a binary search of the index with an average of log2 277778 = 18.08 = 19 block accesses. To find the address of the actual record, which requires a further block access to read, bringing the total to 19 + 1 = 20 block accesses, a far cry from the 1,000,000 block accesses required to find a firstName match in the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks extra from the above example, a ~28% increase), and that too many indices can cause issues arising from the file systems size limits, careful thought must be used to select the correct fields to index.

Since indices are only used to speed up the searching for a matching field within the records, it stands to reason that indexing fields used only for output would be simply a waste of disk space and processing time when doing an insert or delete operation, and thus should be avoided. Also given the nature of a binary search, the cardinality or uniqueness of the data is important. Indexing on a field with a cardinality of 2 would split the data in half, whereas a cardinality of 1,000 would return approximately 1,000 records. With such a low cardinality the effectiveness is reduced to a linear sort, and the query optimizer will avoid using the index if the cardinality is less than 30% of the record number, effectively making the index a waste of space.

When Should You Consider Indexing Your SQL Tables