Differencebetween Rowsbetween and Rangebetween

Difference between ROWS BETWEEN and RANGE BETWEEN in (Presto) window function OVER clause

  • ROWS are literally number of rows before and after that you want to aggregate. So ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING will end up with 3 rows: the curnet row 1 row before and 1 row after, regardless of the value of orderdate.
  • RANGE will look at the values of orderdate and will decide what should be aggregated and what not. So ORDER BY day RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING would theoretically take all lines with values of orderdate-1, orderdate and orderdate+1 - this can be more than 3 lines (see more explanations here)

In Presto the ROWS is fully implemented, but the RANGE is somehow only partially implemented, and you can only use in with CURRENT ROW and UNBOUNDED.

NOTE: Recent versions of Trino (formerly known as Presto SQL) have full
support for RANGE and GROUPS framing. See this blog post for
an explanation of how they work.

The best way in Presto, to be able to see the diff between the two, is to make sure you have same values of the order clause:

WITH
tt1 (custkey, orderdate, product) AS
( SELECT * FROM ( VALUES ('a','1992-07-10', 3), ('a','1993-08-10', 4), ('a','1994-07-13', 5), ('a','1995-09-13', 5), ('a','1995-09-13', 9), ('a','1997-01-13', 4),
('b','1992-07-10', 6), ('b','1992-07-10', 4), ('b','1994-07-13', 5), ('b','1994-07-13', 9), ('b','1998-11-11', 9) ) )

SELECT *,
array_agg(product) OVER (partition by custkey) c,
array_agg(product) OVER (partition by custkey order by orderdate) c_order,

array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) range_ubub,
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rows_ubub,

array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) range_ubc,
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) rows_ubc,

array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) range_cub,
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) rows_cub,

-- array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING) range22,
-- SYNTAX_ERROR: line 19:65: Window frame RANGE PRECEDING is only supported with UNBOUNDED
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) rows22

from tt1
order by custkey, orderdate, product

You can run, and see full results, and learn from them..

I'll put here only some interesting columns:

custkey   orderdate     product    range_ubc           rows_ubc
a 10/07/1992 3 [3] [3]
a 10/08/1993 4 [3, 4] [3, 4]
a 13/07/1994 5 [3, 4, 5] [3, 4, 5]
a 13/09/1995 5 [3, 4, 5, 5, 9] [3, 4, 5, 5]
a 13/09/1995 9 [3, 4, 5, 5, 9] [3, 4, 5, 5, 9]
a 13/01/1997 4 [3, 4, 5, 5, 9, 4] [3, 4, 5, 5, 9, 4]
b 10/07/1992 4 [6, 4] [6, 4]
b 10/07/1992 6 [6, 4] [6]
b 13/07/1994 5 [6, 4, 5, 9] [6, 4, 5]
b 13/07/1994 9 [6, 4, 5, 9] [6, 4, 5, 9]
b 11/11/1998 9 [6, 4, 5, 9, 9] [6, 4, 5, 9, 9]

If you look at the 5th line of: orderdate:13/09/1995, product:5 (Note: 13/09/1995 appears twice for custkey:a) you can see that the ROWS indeed took all rows from top till current line. But if you look at the RANGE, you see it includes also the value from the row after as it has the exact same orderdate so it is considered in same window.

Specify default value for rowsBetween and rangeBetween in Spark

Thanks to the answer of @astro_asz i've came up with the following solution:

val numberRowsBetween = 2
val window1 = Window.partitionBy("id").orderBy("MONTH")
val window2 = Window.partitionBy("id")
.orderBy(asc("MONTH"))
.rowsBetween(-(numberRowsBetween - 1), 0)

randomDF.withColumn("counter", when(lag(col("number"), numberRowsBetween , 0).over(window1) === 0, 0)
.otherwise(sum(col("number")).over(window2)))

This solution will put a '0' as default value.

Spark Window Functions - rangeBetween dates

Spark >= 2.3

Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame API support is still work in progress.

df.createOrReplaceTempView("df")

spark.sql(
"""SELECT *, mean(some_value) OVER (
PARTITION BY id
ORDER BY CAST(start AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean FROM df""").show()

## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+

Spark < 2.3

As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:

from pyspark.sql import Row

row = Row("id", "start", "some_value")
df = sc.parallelize([
row(1, "2015-01-01", 20.0),
row(1, "2015-01-06", 10.0),
row(1, "2015-01-07", 25.0),
row(1, "2015-01-12", 30.0),
row(2, "2015-01-01", 5.0),
row(2, "2015-01-03", 30.0),
row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))

A small helper and window definition:

from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col

# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400

Finally query:

w = (Window()
.partitionBy(col("id"))
.orderBy(col("start").cast("timestamp").cast("long"))
.rangeBetween(-days(7), 0))

df.select(col("*"), mean("some_value").over(w).alias("mean")).show()

## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+

Far from pretty but works.


* Hive Language Manual, Types

Understand Spark WindowSpec#rangeBetween

rangeBetween considers the actual values in the column. It will check which values are "in range" (including both start and end values). In your example, the current row is the start value and next row is the end value. Since the range is inclusive, all duplicate values will be counted as well.

For example, if the start and end values are 1 and 3 respectively. All values in this range (1,2,3) will be used in the sum.


This is in contrast to rowsBetween. For this function, only the specified rows are counted. I.e., rowsBetween(Window.currentRow, 1) will only consider the current and next rows, whether duplicates exists or not.

Any difference between current row and 0 preceding/following in windowing clause of Oracle analytic functions?

It doesn't really matter which you use. They are two different ways of expressing the windowing, but the optimizer will perform the query the same way. The term "current row" is one that is common to multiple databases with analytic functions, not just Oracle. It's more of a stylistic difference, in the same way that some people prefer count(*) over count(1).



Related Topics



Leave a reply



Submit