Is Order by and Row_Number() Deterministic

Is SQL row_number order guaranteed when a CTE is referenced many times?

No, there is no guarantee that ROW_NUMBER on a non-unique sort list returns the same sequence when a CTE is referenced multiple times. It is very likely to happen, but not guranteed, as the CTE is merely a view.

So always make the sort list unique in such a case, e.g. order by name, id.

how to do global sorting without a unique key in Presto

You are calculating row_number

row_number() over(ORDER BY T.dt)

and ORDER BY column is always the same dt='2021-09-06'. In this case row_number has non-deterministic behavior and can assign the same numbers to different rows from run to run.

The fact that you are always getting the same results in Hive is a coincident, probably you always are running with exactly the same number of splits or even on single mapper, which runs single-threaded and producing results which look like deterministic. Presto may have different parallelism and it affects which rows are passed to the row_number first.

You can try to change something in splits configuration to force more mappers or increase the data size and you will be able to reproduce non-deterministic behavior, many mappers running in parallel on heavy loaded cluster will execute with different speed and different rows will be passed to the row_number.

To have deterministic results, you can add some columns to the ORDER BY which will determine the order of rows. If you have no such columns, then it means that you can have any number of full duplicates.

Even if you do not have unique key, row_number will produce deterministic results if ALL columns are in the order by.

Consider this dataset:

Col1 Col2 Col3
1 1 2
1 1 2
1 1 3
1 1 3

row_number() over(ORDER BY col1) as rn can produce all 4 rows ordered differently each run (let's suppose the dataset is very big one and there are many mappers are running concurrently, some mappers can finish faster, some can fail and restart). Of course, if you have such a small dataset and always processing it in single process, single threaded, the result will be the same, but in general, this is not how databases work.

The same about row_number() over(ORDER BY col1, col2)

But in case of row_number() over(ORDER BY col1, col2, col3) - you will always get the same dataset, guaranteed.

So, the solution is to use as much order by columns as needed to determine the order of rows. In the worst case if you have full duplicates, all columns should be added to the ORDER BY, duplicates will be ordered together and the result will be deterministic.

ROW_NUMBER Without ORDER BY

There is no need to worry about specifying constant in the ORDER BY expression. The following is quoted from the Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions written by Itzik Ben-Gan (it was available for free download from Microsoft free e-books site):

As mentioned, a window order clause is mandatory, and SQL Server
doesn’t allow the ordering to be based on a constant—for example,
ORDER BY NULL. But surprisingly, when passing an expression based on a
subquery that returns a constant—for example, ORDER BY (SELECT
NULL)—SQL Server will accept it. At the same time, the optimizer
un-nests, or expands, the expression and realizes that the ordering is
the same for all rows. Therefore, it removes the ordering requirement
from the input data. Here’s a complete query demonstrating this
technique:

SELECT actid, tranid, val,
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM dbo.Transactions;

Sample Image

Observe in the properties of the Index Scan iterator that the Ordered
property is False, meaning that the iterator is not required to return
the data in index key order


The above means that when you are using constant ordering is not performed. I will strongly recommend to read the book as Itzik Ben-Gan describes in depth how the window functions are working and how to optimize various of cases when they are used.

SQL RANK() versus ROW_NUMBER()

ROW_NUMBER : Returns a unique number for each row starting with 1. For rows that have duplicate values,numbers are arbitarily assigned.

Rank : Assigns a unique number for each row starting with 1,except for rows that have duplicate values,in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.

Oracle SQL: Is ORDER BY non unique field deterministic?

No. SQL in general and Oracle in particular do not guarantee stable sorts. That is, you can run the same query twice and get different ordering -- when the keys have ties.

This is because SQL tables (and result sets) represent unordered sets. Hence, there is no "natural" ordering to fall back on. In general, it is a good idea to include additional keys in the order by to make the sort stable.

EDIT:

I want to add one more thought. Your example is for overall sorting in a query, where the issue is a bit abstract -- that is, any given run of the query is going to look correct. It becomes a bigger issue with window functions. So, it is possible that:

select v.*, row_number() over (order by sortingcol) as col1,
row_number() over (order by sortingcol desc) as col2
from myview v

would yield inconsistent results. Under normal circumstances, we would expect col1 + col2 to be constant. However, with ties, this is often not going to be the case. This affects row_number() and keep. It does not affect rank() or dense_rank() which handle ties consistently.

does result of ORDER BY is always same (SQL Server 2005,2008)?

As well as the undefined order of results mentioned by Panagiotis

The result of ROW_NUMBER is undeterministic in the event of ties. You would need to add a tie breaker of a unique column to the ORDER BY so that the two rows with Number=3 have a deterministic numbering applied.

Assuming Id is unique the following would be deterministic

SELECT I.Id,ROW_NUMBER()OVER(ORDER BY I.Number,I.Id) RN
FROM Item I
ORDER BY RN

SQL: Row number is different when sorting on columns with null values

You have specified an ordering criterion that is not a total order. Example: You order on a column that is always zero. That way the CTE can output a different row order each time.

You are filtering after ordering. That means your filter runs on different rows each time. It might happen to have a lot of matching rows or not.

In general SQL queries are not 100% deterministic thanks to certain constructs. There are more than this one.

Fix: Specify a total order. Use anything to break ties such as ORDER BY X, ID. As a habit I always specify a total order.



Related Topics



Leave a reply



Submit