What's the default window frame for window functions
From Spark Gotchas
Default frame specification depends on other aspects of a given window defintion:
- if the ORDER BY clause is specified and the function accepts the frame specification, then the frame specification is defined by RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,
- otherwise the frame specification is defined by ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
Understanding window function frame with RANGE mode
The documentation states:
- In
ROWS
mode, the offset must yield a non-null, non-negative integer, and the option means that the frame starts or ends the specified number of rows before or after the current row.[...]
- In
RANGE
mode, these options require that theORDER BY
clause specify exactly one column. The offset specifies the maximum difference between the value of that column in the current row and its value in preceding or following rows of the frame. The data type of the offset expression varies depending on the data type of the ordering column.
(The emphasis is mine.)
So with ROWS
, you will get the 5 rows before and the 6 rows after the current row. With RANGE
, you will get those rows where unique1
is no more than 5 less or 6 more than the unique1
of the current row.
In your example, if you consider the first row, ROWS BETWEEN ... AND 6 FOLLOWING
would include the third row, but RANGE BETWEEN ... AND 6 FOLLOWING
would not, because the difference between 8 (the value of unique1
in the third row) and 0 is greater than 6.
Window functions in SQL: why are they called window functions?
Window functions operate on a range of rows, defined by the OVER clause. It is like you are looking through the window, you see sky, star, moon. You are not seeing the whole sky, you see part of the sky.
I came across this beautiful definition of window Function
in simple-talk website. I am reproducing below for more clarity.
Original article on Simple-talk on window functions
Window (also, windowing or windowed) functions perform a calculation
over a set of rows. I like to think of “looking through the window” at
the rows that are being returned and having one last chance to perform
a calculation. The window is defined by the OVER clause which
determines if the rows are partitioned into smaller sets and if they
are ordered. In fact, if you use a window function you will always use
an OVER clause. The OVER clause is also part of the NEXT VALUE FOR
syntax required for the sequence object, but, otherwise it’s used with
window functions.The OVER clause may contain a PARTITION BY option. This breaks the
rows into smaller sets. You might think that this is the same as GROUP
BY, but it’s not. When grouping, one row per unique group is returned.
When using PARTITION BY, all of the detail rows are returned along
with the calculations. If you have a window in your home that is
divided into panes, each pane is a window. When thinking about window
functions, the entire set of results is a partition, but when using
PARTITION BY, each partition can also be considered a window.
PARTITION BY is supported – and optional – for all windowing
functions.The OVER clause may also contain an ORDER BY option. This is
independent of the ORDER BY clause of the query. Some of the functions
require ORDER BY, and it’s not supported by the others. When the order
of the rows is important when applying the calculation, the ORDER BY
is required.Window functions may be used only in the SELECT and ORDER BY clauses
of a query. They are applied after any joining, filtering, or
grouping.
Understanding window function better (Snowflake)
It is all about logical order of execution.
Qualify
In the execution order of a query, QUALIFY is therefore evaluated
after window functions are computed. Typically, a SELECT statement’s
clauses are evaluated in the order shown below:
- From
- Where
- Group by
- Having
- Window
- QUALIFY
- Distinct
- Order by
- Limit
Window/analytical functions are evaluated after GROUP BY.
Sceanrio 1: Correct
SELECT
dex.seller,
date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
AVG(AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)))
OVER (PARTITION BY dex.seller,date_base)--,
FROM REP_DATA_SOURCES.DEAL_EXTENDED dex
GROUP BY 1,2;
Scenario 2: Incorrect (analytical average over value that is not part of group by or wrapped with aggregate function)
SELECT
dex.seller,
date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)) OVER (PARTITION BY dex.seller,date_base)
FROM REP_DATA_SOURCES.DEAL_EXTENDED dex
GROUP BY 1,2;
If we skip the analitical part of query, it is incorrect itself at the GROUP BY
level:
SELECT
dex.seller,
date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)
FROM REP_DATA_SOURCES.DEAL_EXTENDED dex
GROUP BY 1,2;
Expression DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)
is not:
- part of GROUP BY
- is not aggregated value(no agg function around it)
Window function acts not as expected when I use Order By (PySpark)
The simple reason is that the default window range/row spec is Window.UnboundedPreceding
to Window.CurrentRow
, which means that the max is taken from the first row in that partition to the current row, NOT the last row of the partition.
This is a common gotcha. (you can replace .max()
with sum()
and see what output you get. It also changes depending on how you order the partition.)
To solve this, you can specify that you want the max of each partition to always be calculated using the full window partition, like so:
window_spec = Window.partitionBy(df['CATEGORY']).orderBy(df['REVENUE']).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
revenue_difference = F.max(df['REVENUE']).over(window_spec)
df.select(
df['CATEGORY'],
df['REVENUE'],
revenue_difference.alias("revenue_difference")).show()
+----------+-------+------------------+
| CATEGORY|REVENUE|revenue_difference|
+----------+-------+------------------+
| Tablet| 6500| 6500|
| Tablet| 5500| 6500|
| Tablet| 4500| 6500|
| Tablet| 3000| 6500|
| Tablet| 2500| 6500|
| Tablet| 1500| 6500|
|Cell Phone| 6000| 6000|
|Cell Phone| 6000| 6000|
|Cell Phone| 5000| 6000|
|Cell Phone| 3000| 6000|
|Cell Phone| 3000| 6000|
+----------+-------+------------------+
Why Last_value need frame_clause to have results like First_Value - Window Function PostgreSQL
Technically all window definitions are supposed to have a RANGE
.
The custom is that if no RANGE
is specified, then it is assumed to be `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW'.
With that default RANGE
, last_value()
is always CURRENT ROW
, hence UNBOUNDED FOLLOWING
must be specified to get what you want.
What is window in SQL?
No, it would not be correct. A window is more general than aggregation. I think this will be clear if you look at the definition of the window frame:
over (partition by . . .
order by . . .
range/rows . . .
)
Of course, these are not all needed, but they are part of the definition of a given window.
The PARTITION BY
is equivalent to the GROUP BY
keys, specifying that a given window only has rows with the same key values. So, it would be quite consistent to say that "group by
processes partitions, returning one row per partition".
Note a key point here: GROUP BY
also affects the number of rows in the result set, by returning only one row per partition. Window functions have no such effect on the result set, returning a "summarized" value per row.
In addition, a window is broader than just the PARTITION BY
and can represent a set of rows or values relative to the current row. That is where ORDER BY
and RANGE
/ROW
come in.
SQL Server LAST_VALUE and LEAD
The default window frame for LAST_VALUE
is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
, as a result if you want all the rows after included, you need to use a specified window fra,e. This is documented in LAST_VALUE (Transact-SQL) - Using FIRST_VALUE and LAST_VALUE in a computed expression:
The clause "RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING" is required in this example for the non-zero values to be returned in the DifferenceFromLastQuarter column, as shown below. The default range is "RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW".
As a result, without ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
then you would get the current row with LAST_VALUE
as that is the last row in the default window frame.
LEAD
, on the other hand, doesn't use ROWS
or RANGE BETWEEN
, so has access to the entire scope of the dataset. Trying to use ROWS BETWEEN
will generate an error:
SELECT V.I,
LEAD(V.I) OVER (ORDER BY V.I ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS L
FROM (VALUES(1),(2),(3))V(I);
The function 'LEAD' may not have a window frame.
Related Topics
Presto Sql: Changing Time Zones Using Time Zone String Coming as a Result of a Query Is Not Working
Oracle Locking with Select...For Update Of
Sql Server String to Varbinary Conversion
Language Translation for Tables
Sql Server 2000 - Query a Table's Foreign Key Relationships
Sql Server Freetext Match - How to Sort by Relevance
How to Insert into a Table with Just One Identity Column (Sql Express)
Create a Trigger to Insert Records from a Table to Another One. Get Inserted Values in a Trigger
Oracle SQL to Sort Version Numbers
Ora-01779: Cannot Modify a Column Which Maps to a Non Key-Preserved Table
A Simple SQL Select Query to Crawl All Connected People in a Social Graph
How to Trim All Data in a Select * from Statement
How to Have Temp Tables in a Function
Sql Access Query- Update Row If Exists, Insert If Does Not
Sql Server Reverse Order After Using Desc