What's The Default Window Frame for Window Functions

What's the default window frame for window functions

From Spark Gotchas

Default frame specification depends on other aspects of a given window defintion:

  • if the ORDER BY clause is specified and the function accepts the frame specification, then the frame specification is defined by RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,
  • otherwise the frame specification is defined by ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Understanding window function frame with RANGE mode

The documentation states:

  • In ROWS mode, the offset must yield a non-null, non-negative integer, and the option means that the frame starts or ends the specified number of rows before or after the current row.

[...]

  • In RANGE mode, these options require that the ORDER BY clause specify exactly one column. The offset specifies the maximum difference between the value of that column in the current row and its value in preceding or following rows of the frame. The data type of the offset expression varies depending on the data type of the ordering column.

(The emphasis is mine.)

So with ROWS, you will get the 5 rows before and the 6 rows after the current row. With RANGE, you will get those rows where unique1 is no more than 5 less or 6 more than the unique1 of the current row.

In your example, if you consider the first row, ROWS BETWEEN ... AND 6 FOLLOWING would include the third row, but RANGE BETWEEN ... AND 6 FOLLOWING would not, because the difference between 8 (the value of unique1 in the third row) and 0 is greater than 6.

Window functions in SQL: why are they called window functions?

Window functions operate on a range of rows, defined by the OVER clause. It is like you are looking through the window, you see sky, star, moon. You are not seeing the whole sky, you see part of the sky.

I came across this beautiful definition of window Function in simple-talk website. I am reproducing below for more clarity.

Original article on Simple-talk on window functions

Window (also, windowing or windowed) functions perform a calculation
over a set of rows. I like to think of “looking through the window” at
the rows that are being returned and having one last chance to perform
a calculation. The window is defined by the OVER clause which
determines if the rows are partitioned into smaller sets and if they
are ordered. In fact, if you use a window function you will always use
an OVER clause. The OVER clause is also part of the NEXT VALUE FOR
syntax required for the sequence object, but, otherwise it’s used with
window functions.

The OVER clause may contain a PARTITION BY option. This breaks the
rows into smaller sets. You might think that this is the same as GROUP
BY, but it’s not. When grouping, one row per unique group is returned.
When using PARTITION BY, all of the detail rows are returned along
with the calculations. If you have a window in your home that is
divided into panes, each pane is a window. When thinking about window
functions, the entire set of results is a partition, but when using
PARTITION BY, each partition can also be considered a window.
PARTITION BY is supported – and optional – for all windowing
functions.

The OVER clause may also contain an ORDER BY option. This is
independent of the ORDER BY clause of the query. Some of the functions
require ORDER BY, and it’s not supported by the others. When the order
of the rows is important when applying the calculation, the ORDER BY
is required.

Window functions may be used only in the SELECT and ORDER BY clauses
of a query. They are applied after any joining, filtering, or
grouping.

Understanding window function better (Snowflake)

It is all about logical order of execution.

Qualify
In the execution order of a query, QUALIFY is therefore evaluated
after window functions are computed. Typically, a SELECT statement’s
clauses are evaluated in the order shown below:

  • From
  • Where
  • Group by
  • Having
  • Window
  • QUALIFY
  • Distinct
  • Order by
  • Limit

Window/analytical functions are evaluated after GROUP BY.

Sceanrio 1: Correct

SELECT 
dex.seller,
date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
AVG(AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)))
OVER (PARTITION BY dex.seller,date_base)--,
FROM REP_DATA_SOURCES.DEAL_EXTENDED dex
GROUP BY 1,2;

Scenario 2: Incorrect (analytical average over value that is not part of group by or wrapped with aggregate function)

SELECT 
dex.seller,
date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)) OVER (PARTITION BY dex.seller,date_base)
FROM REP_DATA_SOURCES.DEAL_EXTENDED dex
GROUP BY 1,2;

If we skip the analitical part of query, it is incorrect itself at the GROUP BY level:

SELECT 
dex.seller,
date_trunc('month',dex.CREATED_AT_DATE) AS date_base,
AVG(DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)),
DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate)
FROM REP_DATA_SOURCES.DEAL_EXTENDED dex
GROUP BY 1,2;

Expression DATEDIFF('days',dex.CREATED_AT_DATE,dex.firstdate) is not:

  • part of GROUP BY
  • is not aggregated value(no agg function around it)

Window function acts not as expected when I use Order By (PySpark)

The simple reason is that the default window range/row spec is Window.UnboundedPreceding to Window.CurrentRow, which means that the max is taken from the first row in that partition to the current row, NOT the last row of the partition.

This is a common gotcha. (you can replace .max() with sum() and see what output you get. It also changes depending on how you order the partition.)

To solve this, you can specify that you want the max of each partition to always be calculated using the full window partition, like so:

window_spec = Window.partitionBy(df['CATEGORY']).orderBy(df['REVENUE']).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

revenue_difference = F.max(df['REVENUE']).over(window_spec)

df.select(
df['CATEGORY'],
df['REVENUE'],
revenue_difference.alias("revenue_difference")).show()
+----------+-------+------------------+
| CATEGORY|REVENUE|revenue_difference|
+----------+-------+------------------+
| Tablet| 6500| 6500|
| Tablet| 5500| 6500|
| Tablet| 4500| 6500|
| Tablet| 3000| 6500|
| Tablet| 2500| 6500|
| Tablet| 1500| 6500|
|Cell Phone| 6000| 6000|
|Cell Phone| 6000| 6000|
|Cell Phone| 5000| 6000|
|Cell Phone| 3000| 6000|
|Cell Phone| 3000| 6000|
+----------+-------+------------------+

Why Last_value need frame_clause to have results like First_Value - Window Function PostgreSQL

Technically all window definitions are supposed to have a RANGE.

The custom is that if no RANGE is specified, then it is assumed to be `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW'.

With that default RANGE, last_value() is always CURRENT ROW, hence UNBOUNDED FOLLOWING must be specified to get what you want.

What is window in SQL?

No, it would not be correct. A window is more general than aggregation. I think this will be clear if you look at the definition of the window frame:

over (partition by . . . 
order by . . .
range/rows . . .
)

Of course, these are not all needed, but they are part of the definition of a given window.

The PARTITION BY is equivalent to the GROUP BY keys, specifying that a given window only has rows with the same key values. So, it would be quite consistent to say that "group by processes partitions, returning one row per partition".

Note a key point here: GROUP BY also affects the number of rows in the result set, by returning only one row per partition. Window functions have no such effect on the result set, returning a "summarized" value per row.

In addition, a window is broader than just the PARTITION BY and can represent a set of rows or values relative to the current row. That is where ORDER BY and RANGE/ROW come in.

SQL Server LAST_VALUE and LEAD

The default window frame for LAST_VALUE is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, as a result if you want all the rows after included, you need to use a specified window fra,e. This is documented in LAST_VALUE (Transact-SQL) - Using FIRST_VALUE and LAST_VALUE in a computed expression:

The clause "RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING" is required in this example for the non-zero values to be returned in the DifferenceFromLastQuarter column, as shown below. The default range is "RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW".

As a result, without ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING then you would get the current row with LAST_VALUE as that is the last row in the default window frame.

LEAD, on the other hand, doesn't use ROWS or RANGE BETWEEN, so has access to the entire scope of the dataset. Trying to use ROWS BETWEEN will generate an error:

SELECT V.I,
LEAD(V.I) OVER (ORDER BY V.I ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) AS L
FROM (VALUES(1),(2),(3))V(I);

The function 'LEAD' may not have a window frame.



Related Topics



Leave a reply



Submit