How Does Select Top Works When No Order by Is Specified

how does SELECT TOP works when no order by is specified?

There is no guarantee which two rows you get. It will just be the first two retrieved from the table scan.

The TOP iterator in the execution plan will stop requesting rows once two have been returned.

Likely for a scan of a heap this will be the first two rows in allocation order but this is not guaranteed. For example SQL Server might use the advanced scanning feature which means that your scan will read pages recently read from another concurrent scan.

When no 'Order by' is specified, what order does a query choose for your record set?

If you don't specify an ORDER BY, then there is NO ORDER defined.

The results can be returned in an arbitrary order - and that might change over time, too.

There is no "natural order" or anything like that in a relational database (at least in all that I know of). The only way to get a reliable ordering is by explicitly specifying an ORDER BY clause.

Update: for those who still don't believe me - here's two excellent blog posts that illustrate this point (with code samples!) :

  • Conor Cunningham (Architect on the Core SQL Server Engine team): No Seatbelt - Expecting Order without ORDER BY
  • Alexander Kuznetsov: Without ORDER BY, there is no default sort order (post in the Web Archive)

Unexpected SQL Behaviour with SELECT TOP Query

A TOP without ORDER BY is unpredictable. This is documented by Microsoft. From Microsoft docs:

When you use TOP with the ORDER BY clause, the result set is limited to the first N number of ordered rows. Otherwise, TOP returns the first N number of rows in an undefined order.

...

In a SELECT statement, always use an ORDER BY clause with the TOP clause. Because, it's the only way to predictably indicate which rows are affected by TOP.

See also how does SELECT TOP works when no order by is specified?

Why does SELECT TOP 1 . . . ORDER BY return the second row in the table?

In relational databases tables have no inherent order. The ORDER BY you give is not distinct over all records, in fact it's the same over all records. So the order in which results are returned is still not deterministic and unpredictable. And therefor the top 1 returns an unpredictable row.

You say "Adam's details are first in the table", this is simply not true; records in a table are stored without any order. If you select without an order by or (as in your case) the order by is not deterministic the returned order is arbitrary.

The order of a SQL Select statement without Order By clause

No, that behavior cannot be relied on. The order is determined by the way the query planner has decided to build up the result set. simple queries like select * from foo_table are likely to be returned in the order they are stored on disk, which may be in primary key order or the order they were created, or some other random order. more complex queries, such as select * from foo where bar < 10 may instead be returned in order of a different column, based on an index read, or by the table order, for a table scan. even more elaborate queries, with multipe where conditions, group by clauses, unions, will be in whatever order the planner decides is most efficient to generate.

The order could even change between two identical queries just because of data that has changed between those queries. a "where" clause may be satisfied with an index scan in one query, but later inserts could make that condition less selective, and the planner could decide to perform a subsequent query using a table scan.


To put a finer point on it. RDBMS systems have the mandate to give you exactly what you asked for, as efficiently as possible. That efficiency can take many forms, including minimizing IO (both to disk as well as over the network to send data to you), minimizing CPU and keeping the size of its working set small (using methods that require minimal temporary storage).

without an ORDER BY clause, you will have not asked exactly for a particular order, and so the RDBMS will give you those rows in some order that (maybe) corresponds with some coincidental aspect of the query, based on whichever algorithm the RDBMS expects to produce the data the fastest.

If you care about efficiency, but not order, skip the ORDER BY clause. If you care about the order but not efficiency, use the ORDER BY clause.

Since you actually care about BOTH use ORDER BY and then carefully tune your query and database so that it is efficient.

Why is there no `select last` or `select bottom` in SQL Server like there is `select top`?

You can think of it like this.

SELECT TOP N without ORDER BY returns some N rows, neither first, nor last, just some. Which rows it returns is not defined. You can run the same statement 10 times and get 10 different sets of rows each time.

So, if the server had a syntax SELECT LAST N, then result of this statement without ORDER BY would again be undefined, which is exactly what you get with existing SELECT TOP N without ORDER BY.


You have stressed in your question that you know and understand what I've written below, but I'll still keep it to make it clear for everyone reading this later.

Your first phrase in the question

In SQL-server we have SELECT TOP N ... now in that we can get the
first n rows in ascending order (by default), cool.

is not correct. With SELECT TOP N without ORDER BY you get N "random" rows. Well, not really random, the server doesn't jump randomly from row to row on purpose. It chooses some deterministic way to scan through the table, but there could be many different ways to scan the table and server is free to change the chosen path when it wants. This is what is meant by "undefined".

The server doesn't track the order in which rows were inserted into the table, so again your assumption that results of SELECT TOP N without ORDER BY are determined by the order in which rows were inserted in the table is not correct.


So, the answer to your final question

why no select last/bottom like it's counterpart.

is:

  • without ORDER BY results of SELECT LAST N would be exactly the same as results of SELECT TOP N - undefined.
  • with ORDER BY result of SELECT LAST N ... ORDER BY X ASC is exactly the same as result of SELECT TOP N ... ORDER BY X DESC.

So, there is no point to have two key words that do the same thing.


There is a good point in the Pieter's answer: the word TOP is somewhat misleading. It really means LIMIT result set to some number of rows.

By the way, since SQL Server 2012 they added support for ANSI standard OFFSET:

OFFSET { integer_constant | offset_row_count_expression } { ROW | ROWS }
[
FETCH { FIRST | NEXT } {integer_constant | fetch_row_count_expression } { ROW | ROWS } ONLY
]

Here adding another key word was justified that it is ANSI standard AND it adds important functionality - pagination, which didn't exist before.


I would like to thank @Razort4x here for providing a very good link to MSDN in his question. The "Advanced Scanning" section there has an excellent example of mechanism called "merry-go-round scanning", which demonstrates why the order of the results returned from a SELECT statement cannot be guaranteed without an ORDER BY clause.

This concept is often misunderstood and I've seen many question here on SO that would greatly benefit if they had a quote from that link.


The answer to your question

Why doesn't SQL Server have a SELECT LAST or say SELECT BOTTOM or
something like that, where we don't have to specify the ORDER BY and
then it would give the last record inserted in the table at the time
of executing the query
(again I am not going into details about how
would this result in case of uncommitted reads or phantom reads).

is:

The devil is in the details that you want to omit. To know which record was the "last inserted in the table at the time of executing the query" (and to know this in a somewhat consistent/non-random manner) the server would need to keep track of this information somehow. Even if it is possible in all scenarios of multiple simultaneously running transactions, it is most likely costly from the performance point of view. Not every SELECT would request this information (in fact very few or none at all), but the overhead of tracking this information would always be there.

So, you can think of it like this: by default the server doesn't do anything specific to know/keep track of the order in which the rows were inserted, because it affects performance, but if you need to know that you can use, for example, IDENTITY column. Microsoft could have designed the server engine in such a way that it required an IDENTITY column in every table, but they made it optional, which is good in my opinion. I know better than the server which of my tables need IDENTITY column and which do not.

Summary

I'd like to summarise that you can look at SELECT LAST without ORDER BY in two different ways.

1) When you expect SELECT LAST to behave in line with existing SELECT TOP. In this case result is undefined for both LAST and TOP, i.e. result is effectively the same. In this case it boils down to (not) having another keyword. Language developers (T-SQL language in this case) are always reluctant to add keywords, unless there are good reasons for it. In this case it is clearly avoidable.

2) When you expect SELECT LAST to behave as SELECT LAST INSERTED ROW. Which should, by the way, extend the same expectations to SELECT TOP to behave as SELECT FIRST INSERTED ROW or add new keywords LAST_INSERTED, FIRST_INSERTED to keep existing keyword TOP intact. In this case it boils down to the performance and added overhead of such behaviour. At the moment the server allows you to avoid this performance penalty if you don't need this information. If you do need it IDENTITY is a pretty good solution if you use it carefully.

SELECT TOP N with UNION and ORDER BY

A Union query works thus: execute the queries, then apply the order by clause. So with

SELECT TOP 5 [ID], [Description], [Inactive]
FROM #T1
UNION ALL
SELECT TOP 5 [ID], [Description], [Inactive]
FROM #T2
ORDER BY [Inactive], [Description];

you select five arbitrarily chosen records from #T1 plus five arbitrarily chosen records from #T2 and then you order these. So you need subqueries or with clauses. E.g.:

SELECT * FROM
(
(
SELECT TOP 5 [ID], [Description], [Inactive]
FROM #T1
ORDER BY [Inactive], [Description]
)
UNION ALL
(
SELECT TOP 5 [ID], [Description], [Inactive]
FROM #T2
ORDER BY [Inactive], [Description]
)
) t;

So your workaround is not a workaround at all, but the proper query.



Related Topics



Leave a reply



Submit