Customer.Pk_Name Joining Transactions.Fk_Name VS. Customer.Pk_Id [Serial] Joining Transactions.Fk_Id [Integer]

Is a CLUSTER INDEX desireable when loading a sorted loadfile into a new table?

The script unloads the active and inactive transactions to two different files, with each file sorted by customer name. It then loads them back into the table, active transactions first, followed by inactive transactions. A clustered index is then created on customer name. The problem is that the database now has to go back and re-order the physical rows by customer name when building the clustered index. Although each of the unload files are separately ordered by customer name, when the two are put together the result is not ordered by customer name, causing more work for the database. Unless the separate files for active and inactive transactions are needed elsewhere you might try just dumping all the transactions to a single file, ordered by customer name, and then re-load the table from that single file. At that point the data in the table would be ordered by customer name and the clustered index create wouldn't have to do the re-ordering of the data.

As to whether or not the clustered index is really needed - a clustered index can be of value if you use that column to query with as it should help to reduce the number of I/O's needed to fetch the data. Usually clustered indexes are created on columns which increase monotonically so perhaps TRX_NUM would serve well as the column to be named on the clustered index.

Share and enjoy.

do indexes on boolean columns help page caching

Yes, that reasoning is correct. You can in effect partition the data set into two regions, one hot and one cold. Using a bit is just a special case of this technique. You also could use a date column and cluster on that (of course whether that is feasible or not depends on the schema and data).

Partitioning has a similar effect. Choosing the clustering key is lighter weight and just as good though.

Oftentimes clustering on an auto-incremented number also has good locality because the IDENTITY value correlates with age and age correlates with frequency of usage.

The same optimization does not apply directly to nonclustered indexes. You can use a boolean prefix for them, too, but you need to provide it in a sargable form:

WHERE SomeNCIndexCol = '1234' AND Deleted IN (0, 1)

SQL Server is not smart enough to figure this out by itself. It cannot "skip" the first index level like Oracle can. So we have to provide seek keys manually. (Connect item: https://connect.microsoft.com/SQLServer/feedback/details/695044)

A different concern is write performance. Marking a row as deleted (SET Deleted = 1) now requires a physical delete+insert pair for the CI plus one for each NCI. Primary key changes are not supported by most ORMs so you probably should not set this clustering key as the primary key.

As a side note creating an index on a bit column has other use cases as well. If 99% of the values are zero or one you can definitely use the index to perform a seek and key lookup. You can also use such an index for counting (or grouping on the bit column).

Using indexes on memory-optimized tables

From your questions.

It is my understanding that memory-optimized indexes do not fragment.
As an in-memory table would I reverse the PK (PK2, PK1) and have a second index on PK1?
Is there no reason to drop and recreate the index on PK1?
Does index fragmentation truly go away in a memory-optimized table?

Question 1, yes memory optimized indexes don't fragment.

Question 2, no. what you want is a hash index on PK2 and a hash index on PK1. if you want to preserve key uniqueness on PK1, then you'd need a non clustered key on PK1 and PK2. Be careful that PK2 doesn't have a lot of repetition.

Question 3, dropping and re-creating an index can't be done in memory optimized tables.

Question 4, yes fragmentation goes away with memory optimized tables.

Thanks Guy

Clustered index - multi-part vs single-part index and effects of inserts/deletes

Yes, inserting into the middle of an existing table (or its page) could be expensive when you have a less than optimal clustered index. Worst case would be a page split : half the rows on the page would have to be moved elsewhere, and indices (including non-clustered indices on that table) need to be updated.

You can alleviate that problem by using the right clustered index - one that ideally is:

narrow (only a single field, as small as possible)
static (never changes)
unique (so that SQL Server doesn't need to add 4-byte uniqueifiers to your rows)
ever-increasing (like an INT IDENTITY)

You want a narrow key (ideally a single INT) since each and every entry in each and every non-clustered index will also contain the clustering key(s) - you don't want to put lots of columns in your clustering key, nor do you want to put things like VARCHAR(200) there!

With an ever increasing clustered index, you will never see the case of a page split. The only fragmentation you could encounter is from deletes ("swiss cheese" problem).

Check out Kimberly Tripp's excellet blog posts on indexing - most notably:

GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues... - this one actually shows that a good clustered index will speed up all operations - including inserts, delete etc., compared to a heap with no clustered index!
Ever-increasing clustering key - the Clustered Index Debate..........again!

Assume there is a table (Junk) and
there are two queries that are done on
the table, the first query searches by
Name and the second query searches by
Name and Something. As I'm working on
the database I discovered that the
table has been created with two
indexes, one to support each query,
like so:

That's definitely not necessary - if you have one index on (Name, Something), that index can also and just as well be used if you search and restrict on just WHERE Name = abc - having a separate index with just the Name column is totally not needed and only wastes space (and costs time to be kept up to date).

So basically, you only need a single index on (Name, Something), and I would agree with you - if you have no other indices on this table, then you should be able to make this the clustered key. Since that key won't be ever-increasing and could possibly change, too (right?), this might not be such a great idea.

The other option would be to introduce a surrogate ID INT IDENTITY and cluster on that - with two benefits:

it's all a good clustered key should be, including ever-increasing -> you'll never have any issues with page splits and performance for INSERT operations
you still get all the benefits of having a clustering key (see Kim Tripps' blog posts - clustered tables are almost always preferable to heaps)

How can I display and manipulate record arrays?

Informix 4GL could do it; Informix SQL, even with ESQL/C assistance, cannot sensibly do it. I don't know about Progress or Oracle, but it's likely that they can do something similar.

In I4GL, you would pull up the master record information, then using regular DISPLAY statements (not DISPLAY ARRAY) you would display the detail information in the screen rows of the detail section. When the user wanted to choose a row to update, you would go into either a DISPLAY ARRAY or (possibly) an INPUT ARRAY statement.