Is SQL row_number order guaranteed when a CTE is referenced many times?
No, there is no guarantee that ROW_NUMBER
on a non-unique sort list returns the same sequence when a CTE is referenced multiple times. It is very likely to happen, but not guranteed, as the CTE is merely a view.
So always make the sort list unique in such a case, e.g. order by name, id
.
how to do global sorting without a unique key in Presto
You are calculating row_number
row_number() over(ORDER BY T.dt)
and ORDER BY column is always the same dt='2021-09-06'. In this case row_number has non-deterministic behavior and can assign the same numbers to different rows from run to run.
The fact that you are always getting the same results in Hive is a coincident, probably you always are running with exactly the same number of splits or even on single mapper, which runs single-threaded and producing results which look like deterministic. Presto may have different parallelism and it affects which rows are passed to the row_number first.
You can try to change something in splits configuration to force more mappers or increase the data size and you will be able to reproduce non-deterministic behavior, many mappers running in parallel on heavy loaded cluster will execute with different speed and different rows will be passed to the row_number.
To have deterministic results, you can add some columns to the ORDER BY which will determine the order of rows. If you have no such columns, then it means that you can have any number of full duplicates.
Even if you do not have unique key, row_number will produce deterministic results if ALL columns are in the order by.
Consider this dataset:
Col1 Col2 Col3
1 1 2
1 1 2
1 1 3
1 1 3
row_number() over(ORDER BY col1) as rn
can produce all 4 rows ordered differently each run (let's suppose the dataset is very big one and there are many mappers are running concurrently, some mappers can finish faster, some can fail and restart). Of course, if you have such a small dataset and always processing it in single process, single threaded, the result will be the same, but in general, this is not how databases work.
The same about row_number() over(ORDER BY col1, col2)
But in case of row_number() over(ORDER BY col1, col2, col3)
- you will always get the same dataset, guaranteed.
So, the solution is to use as much order by columns as needed to determine the order of rows. In the worst case if you have full duplicates, all columns should be added to the ORDER BY, duplicates will be ordered together and the result will be deterministic.
ROW_NUMBER Without ORDER BY
There is no need to worry about specifying constant in the ORDER BY
expression. The following is quoted from the Microsoft SQL Server 2012 High-Performance T-SQL Using Window Functions written by Itzik Ben-Gan
(it was available for free download from Microsoft free e-books site):
As mentioned, a window order clause is mandatory, and SQL Server
doesn’t allow the ordering to be based on a constant—for example,
ORDER BY NULL. But surprisingly, when passing an expression based on a
subquery that returns a constant—for example, ORDER BY (SELECT
NULL)—SQL Server will accept it. At the same time, the optimizer
un-nests, or expands, the expression and realizes that the ordering is
the same for all rows. Therefore, it removes the ordering requirement
from the input data. Here’s a complete query demonstrating this
technique:
SELECT actid, tranid, val,
ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS rownum
FROM dbo.Transactions;
Observe in the properties of the Index Scan iterator that the Ordered
property is False, meaning that the iterator is not required to return
the data in index key order
The above means that when you are using constant ordering is not performed. I will strongly recommend to read the book as Itzik Ben-Gan
describes in depth how the window functions are working and how to optimize various of cases when they are used.
SQL RANK() versus ROW_NUMBER()
ROW_NUMBER : Returns a unique number for each row starting with 1. For rows that have duplicate values,numbers are arbitarily assigned.
Rank : Assigns a unique number for each row starting with 1,except for rows that have duplicate values,in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.
Oracle SQL: Is ORDER BY non unique field deterministic?
No. SQL in general and Oracle in particular do not guarantee stable sorts. That is, you can run the same query twice and get different ordering -- when the keys have ties.
This is because SQL tables (and result sets) represent unordered sets. Hence, there is no "natural" ordering to fall back on. In general, it is a good idea to include additional keys in the order by
to make the sort stable.
EDIT:
I want to add one more thought. Your example is for overall sorting in a query, where the issue is a bit abstract -- that is, any given run of the query is going to look correct. It becomes a bigger issue with window functions. So, it is possible that:
select v.*, row_number() over (order by sortingcol) as col1,
row_number() over (order by sortingcol desc) as col2
from myview v
would yield inconsistent results. Under normal circumstances, we would expect col1 + col2
to be constant. However, with ties, this is often not going to be the case. This affects row_number()
and keep
. It does not affect rank()
or dense_rank()
which handle ties consistently.
does result of ORDER BY is always same (SQL Server 2005,2008)?
As well as the undefined order of results mentioned by Panagiotis
The result of ROW_NUMBER
is undeterministic in the event of ties. You would need to add a tie breaker of a unique column to the ORDER BY
so that the two rows with Number=3
have a deterministic numbering applied.
Assuming Id
is unique the following would be deterministic
SELECT I.Id,ROW_NUMBER()OVER(ORDER BY I.Number,I.Id) RN
FROM Item I
ORDER BY RN
SQL: Row number is different when sorting on columns with null values
You have specified an ordering criterion that is not a total order. Example: You order on a column that is always zero. That way the CTE can output a different row order each time.
You are filtering after ordering. That means your filter runs on different rows each time. It might happen to have a lot of matching rows or not.
In general SQL queries are not 100% deterministic thanks to certain constructs. There are more than this one.
Fix: Specify a total order. Use anything to break ties such as ORDER BY X, ID
. As a habit I always specify a total order.
Related Topics
What Is the Resource Impact from Normalizing a Database
Converting Aggregate Operators from SQL to Relational Algebra
Ms SQL Server Last Inserted Id
Unique Constraint Controlled by a Bit Column
Passing SQL Stored Procedure Entirety of Where Clause
Accessing Column Alias in Postgresql
Trouble Making a Running Sum in Access Query
Want a Stored Procedure for Comma Seperated String Which Is of a Column (Has 20000 Rows ) in a Table
Is It Better to Do an Equi Join in the from Clause or Where Clause
SQL Server: Only Last Entry in Group By
SQL Server Convert Datetime into Another Timezone
Parse a Date from Unformatted Text in SQL
Reverse- Geocoding: How to Determine the City Closest to a (Lat,Lon) with Bigquery SQL
Converting Delimited String to Multiple Values in MySQL