SQL Query for Index/Primary Key Ordinal

SQL query for index/primary key ordinal

With a regular table, there is not much you can do in PostgreSQL 9.1. count() results in a table scan, because indexes do not have visibility information. To verify the rows are not deleted in the meantime, PostgreSQL has to visit the table.

If the table is read-only (or rarely updated), you could add a row number to the table. Then a query like:

SELECT rownumber+1
FROM standings
WHERE score < ?
ORDER BY score DESC
LIMIT 1;

With an index:

CREATE INDEX standings_score_idx ON standings (score DESC);

Would get the result almost instantly.
However, that's not an option for a table with write load for obvious reasons. So not for you.


The good news: one of the major new features of the upcoming PostgreSQL 9.2 is just right for you: "Covering index" or "index-only scan". I quote the 9.2 release notes here:

Allow queries to retrieve data only from indexes, avoiding heap access
(Robert Haas, Ibrar Ahmed, Heikki Linnakangas, Tom Lane)

This is often called "index-only scans" or "covering indexes". This is
possible for heap pages with exclusively all-visible tuples, as
reported by the visibility map. The visibility map was made crash-safe
as a necessary part of implementing this feature.

This blog post by Robert Haas has more details how this affects count performance. It helps performance even with a WHERE clause, like in your case.

How do you list the primary key of a SQL Server table?

SELECT Col.Column_Name from 
INFORMATION_SCHEMA.TABLE_CONSTRAINTS Tab,
INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE Col
WHERE
Col.Constraint_Name = Tab.Constraint_Name
AND Col.Table_Name = Tab.Table_Name
AND Constraint_Type = 'PRIMARY KEY'
AND Col.Table_Name = '<your table name>'

Position of primary key in the column list

In a page ,SQL Server will re-arrange your columns to store all of the fixed width columns first and the variable columns last. In both the fixed-width portion of the row as well as the variable-width portion of the row, the columns are defined in the order in which they are declared.

When you create a Table with Primary Key/create an Index ,SQLServer arranges its rows(columns will be defined in above said order) internally based on Primary key sort order..

Any Incoming rows will be added based on that order..not based on column position..

So it doesn't matter what is the ordinal position of the primary key..

Indexed ORDER BY with LIMIT 1

Assuming we are dealing with a big table, a partial index might help:

CREATE INDEX tbl_created_recently_idx ON tbl (created_at DESC)
WHERE created_at > '2013-09-15 0:0'::timestamp;

As you already found out: descending or ascending hardly matters here. Postgres can scan backwards at almost the same speed (exceptions apply with multi-column indices).

Query to use this index:

SELECT * FROM tbl
WHERE created_at > '2013-09-15 0:0'::timestamp -- matches index
ORDER BY created_at DESC
LIMIT 1;

The point here is to make the index much smaller, so it should be easier to cache and maintain.

  1. You need to pick a timestamp that is guaranteed to be smaller than the most recent one.
  2. You should recreate the index from time to time to cut off old data.
  3. The condition needs to be IMMUTABLE.

So the one-time effect deteriorates over time. The specific problem is the hard coded condition:

WHERE created_at > '2013-09-15 0:0'::timestamp

Automate

You could update the index and your queries manually from time to time. Or you automate it with the help of a function like this one:

CREATE OR REPLACE FUNCTION f_min_ts()
RETURNS timestamp LANGUAGE sql IMMUTABLE AS
$$SELECT '2013-09-15 0:0'::timestamp$$

Index:

CREATE INDEX tbl_created_recently_idx ON tbl (created_at DESC);
WHERE created_at > f_min_ts();

Query:

SELECT * FROM tbl
WHERE created_at > f_min_ts()
ORDER BY created_at DESC
LIMIT 1;

Automate recreation with a cron job or some trigger-based event. Your queries can stay the same now. But you need to recreate all indices using this function in any way after changing it. Just drop and create each one.

First ..

... test whether you are actually hitting the bottle neck with this.

Try whether a simple DROP index ... ; CREATE index ... does the job. Then your index might have been bloated. Your autovacuum settings may be off.

Or try VACUUM FULL ANALYZE to get your whole table plus indices in pristine condition and check again.

Other options include the usual general performance tuning and covering indexes, depending on what you actually retrieve from the table.

How important is the order of columns in indexes?

Look at an index like this:

Cols
1 2 3
-------------
| | 1 | |
| A |---| |
| | 2 | |
|---|---| |
| | | |
| | 1 | 9 |
| B | | |
| |---| |
| | 2 | |
| |---| |
| | 3 | |
|---|---| |

See how restricting on A first, as your first column eliminates more results than restricting on your second column first? It's easier if you picture how the index must be traversed across, column 1, then column 2, etc...you see that lopping off most of the results in the fist pass makes the 2nd step that much faster.

Another case, if you queried on column 3, the optimizer wouldn't even use the index, because it's not helpful at all in narrowing down the result sets. Anytime you're in a query, narrowing down the number of results to deal with before the next step means better performance.

Since the index is also stored this way, there's no backtracking across the index to find the first column when you're querying on it.

In short: No, it's not for show, there are real performance benefits.

Clustered index in SQL Server : any advantage for the columns to be first in schema?

I won't assume what type of clustered index we are talking about here, so I will try to cover all the basics. I would have to say that, logically, the impact (performance or otherwise) of the ordinal position of the columns within your table in relation to their ordinal position within the clustered index is inconsequential (unless someone out there has something to prove me wrong).

Rowstore

Keep in mind that your table data and rowstore clustered indexes end up becoming separate logical structures. Per Microsoft regarding the clustered rowstore index architecture:

indexes are organized as B-Trees. Each page in an index B-tree is called an index node. The top node of the B-tree is called the root node. The bottom nodes in the index are called the leaf nodes. Any index levels between the root and the leaf nodes are collectively known as intermediate levels. In a clustered index, the leaf nodes contain the data pages of the underlying table. The root and intermediate level nodes contain index pages holding index rows.

So when we are talking about the physical storage of both the clustered index and the table data, we can think of them as separate structures. Looking at this image from the same link:
Sample Image

All three of these levels have at least one thing in common. They are all storing values (more or less) logically sorted by the value of your clustered index. Regardless of the ordinal position of the columns within your table structure, the leaf pages for your table data will be stored logically ordered by the columns/values within your clustered index. This is also true of your intermediate pages, which represent the storage of your clustered index values.

So all of that to say, the ordinal position of your columns within the clustered index is actually what determines how both the intermediate level and leaf pages are logically ordered, so the ordinal position of those columns within your table statement really has no impact to their storage order because of their inclusion in your clustered index.

Columnstore

Regarding clustered columnstore indexes, I would again say that it has no impact, but for a different (and simpler) reason. The columnstore index breaks up the column values in to separate logical structures, which have no relation to each other by way of their ordinal position. So regardless of the column's ordinal position within the table, when you query a value from a column you are querying the separate physical structure that represents that column's values (ignoring deltastore for simplicity here). Similarly, when you query multiple column's values, you are querying each individual logical structure that represents each column's values separately.

This is why you are not even able to specify a column list when creating a clustered columnstore index. The ordinal position of the columns within the columnstore index itself has no impact, so I'd imagine that the ordinal position of those columns within the table itself (or any relationship between the two) also has no impact.

Heap

Lastly, should anyone else ask, even with tables stored as a heap I would still argue that the ordinal position of columns within the table has no impact to any query performance. Under the hood, heaps are still stored and referenced by a sort of clustered index structure (I believe it would still be described that way).

Per Microsoft:

A rowstore is data that is logically organized as a table with rows and columns, and then physically stored in a row-wise data format. This has been the traditional way to store relational table data such as a heap or clustered B-tree index.

So heaps are still stored in an ordered fashion just like any other table created using a clustered index, but the main difference is that the value they are ordered by is simply non-business use value created in order to identify the row. As described by Microsoft:

If the table is a heap, which means it does not have a clustered index, the row locator is a pointer to the row. The pointer is built from the file identifier (ID), page number, and number of the row on the page. The whole pointer is known as a Row ID (RID).

This RID is not something you would ever normally use as a predicate to a query, which is the main disadvantage (since data is made to be queried, right?). But regardless, the ordinal position of these columns within your table still has no impact to how they are actually logically sorted/stored, so I can't imagine that it could impact your query performance.

Speeding ORDER BY clause with index

Non-clustered indexes can absolutely be used to optimize away a sort. Indexes are essentially binary search trees, which means they contain the values sorted in order.

However, depending on the query, you may be putting SQL Server into a conundrum.

If you have a table with 100 million rows, your query will match 11 million of them, like below, is it cheaper to use an index on category to select the rows and sort the results by name, or to read all 100 million rows out of the index pre-sorted by name, and then filter down 89 million of them by checking the category?

select ...
from product
where category = ?
order by name;

In theory, SQL Server may be able to use an index on name to read rows in order and use the index on category to filter efficiently? I'm skeptical. I have rarely seen SQL Server use multiple indexes to access the same table in the same query (assuming a single table selection, ignoring joins or recursive CTE's). It would have to check the index 100 million times. Indexes have a high overhead cost per index search, so they are efficient when a single search narrows down the result set by a lot.

Without seeing a schema, statistics, and exact query, it's hard for me to say what makes sense, but I expect I would find SQL Server would use an index for the where clause and sort the results, ignoring an index on the sort column.

An index on the sort column may be used if you are selecting the entire table though. Like select ... from product order by name;

Again, your milage may vary. This is speculation based off past experience.

SQL - Get the index of column that has maximum value

A more general solution (i.e. N columns) to this is to Unpivot the columns into rows, and then a windowing function can be applied to obtain the group wise maximum to each set of column 'rows'. You will however need some kind of key for each row, so that the maximum can be applied in row wise fashion (to allow reassembling the original rows). I've done this by adding a surrogate Guid via newId(). Note this returns the column NAME with the highest value in each row:

WITH MyTableWithRowId AS
(
SELECT newId() AS Id, *
FROM MyTable
),
Unpivoted AS
(
SELECT Ndx, Id, col, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY col DESC) AS Rnk
FROM
MyTableWithRowId tbl
UNPIVOT
(
col for Ndx in(col1, col2, col3)
) p
)
SELECT Ndx
FROM Unpivoted
WHERE Rnk = 1

SqlFiddle here

Edit, re just '1, 2, 3' not the name of the column (col1, col2, col3)

As per @Giorgi's comment, if you really want the (one based) ordinal position of the column in each row, you can join back into DMV's such as INFORMATION_SCHEMA.COLUMNS to look up the ordinal, although this would be terribly fragile strategy IMO.

WITH MyTableWithRowId AS
(
SELECT newId() AS Id, col1, col2, col3
FROM MyTable
),
TheOrdinalPositionOfColumns AS
(
SELECT COLUMN_NAME, ORDINAL_POSITION
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'MyTable'
),
Unpivoted AS
(
SELECT Ndx, Id, col, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY col DESC) AS Rnk
FROM
MyTableWithRowId tbl
UNPIVOT
(
col for Ndx in(col1, col2, col3)
) p
)
SELECT topoc.ORDINAL_POSITION AS ColumnOrdinalPosition
FROM Unpivoted
JOIN TheOrdinalPositionOfColumns topoc ON Unpivoted.Ndx = topoc.COLUMN_NAME
WHERE Rnk = 1;

Updated Fiddle with Giorgi's Column naming



Related Topics



Leave a reply



Submit