Difference between ROWS BETWEEN and RANGE BETWEEN in (Presto) window function OVER clause
ROWS
are literally number of rows before and after that you want to aggregate. SoORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
will end up with 3 rows: the curnet row 1 row before and 1 row after, regardless of the value of orderdate.RANGE
will look at the values of orderdate and will decide what should be aggregated and what not. SoORDER BY day RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING
would theoretically take all lines with values of orderdate-1, orderdate and orderdate+1 - this can be more than 3 lines (see more explanations here)
In Presto the ROWS
is fully implemented, but the RANGE
is somehow only partially implemented, and you can only use in with CURRENT ROW
and UNBOUNDED
.
NOTE: Recent versions of Trino (formerly known as Presto SQL) have full
support forRANGE
andGROUPS
framing. See this blog post for
an explanation of how they work.
The best way in Presto, to be able to see the diff between the two, is to make sure you have same values of the order clause:
WITH
tt1 (custkey, orderdate, product) AS
( SELECT * FROM ( VALUES ('a','1992-07-10', 3), ('a','1993-08-10', 4), ('a','1994-07-13', 5), ('a','1995-09-13', 5), ('a','1995-09-13', 9), ('a','1997-01-13', 4),
('b','1992-07-10', 6), ('b','1992-07-10', 4), ('b','1994-07-13', 5), ('b','1994-07-13', 9), ('b','1998-11-11', 9) ) )
SELECT *,
array_agg(product) OVER (partition by custkey) c,
array_agg(product) OVER (partition by custkey order by orderdate) c_order,
array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) range_ubub,
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rows_ubub,
array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) range_ubc,
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) rows_ubc,
array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) range_cub,
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) rows_cub,
-- array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING) range22,
-- SYNTAX_ERROR: line 19:65: Window frame RANGE PRECEDING is only supported with UNBOUNDED
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) rows22
from tt1
order by custkey, orderdate, product
You can run, and see full results, and learn from them..
I'll put here only some interesting columns:
custkey orderdate product range_ubc rows_ubc
a 10/07/1992 3 [3] [3]
a 10/08/1993 4 [3, 4] [3, 4]
a 13/07/1994 5 [3, 4, 5] [3, 4, 5]
a 13/09/1995 5 [3, 4, 5, 5, 9] [3, 4, 5, 5]
a 13/09/1995 9 [3, 4, 5, 5, 9] [3, 4, 5, 5, 9]
a 13/01/1997 4 [3, 4, 5, 5, 9, 4] [3, 4, 5, 5, 9, 4]
b 10/07/1992 4 [6, 4] [6, 4]
b 10/07/1992 6 [6, 4] [6]
b 13/07/1994 5 [6, 4, 5, 9] [6, 4, 5]
b 13/07/1994 9 [6, 4, 5, 9] [6, 4, 5, 9]
b 11/11/1998 9 [6, 4, 5, 9, 9] [6, 4, 5, 9, 9]
If you look at the 5th line of: orderdate:13/09/1995, product:5
(Note: 13/09/1995
appears twice for custkey:a
) you can see that the ROWS
indeed took all rows from top till current line. But if you look at the RANGE
, you see it includes also the value from the row after as it has the exact same orderdate
so it is considered in same window.
Specify default value for rowsBetween and rangeBetween in Spark
Thanks to the answer of @astro_asz i've came up with the following solution:
val numberRowsBetween = 2
val window1 = Window.partitionBy("id").orderBy("MONTH")
val window2 = Window.partitionBy("id")
.orderBy(asc("MONTH"))
.rowsBetween(-(numberRowsBetween - 1), 0)
randomDF.withColumn("counter", when(lag(col("number"), numberRowsBetween , 0).over(window1) === 0, 0)
.otherwise(sum(col("number")).over(window2)))
This solution will put a '0' as default value.
Spark Window Functions - rangeBetween dates
Spark >= 2.3
Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame
API support is still work in progress.
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, mean(some_value) OVER (
PARTITION BY id
ORDER BY CAST(start AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean FROM df""").show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Spark < 2.3
As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY
clause used with RANGE
to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start
column contains date
type:
from pyspark.sql import Row
row = Row("id", "start", "some_value")
df = sc.parallelize([
row(1, "2015-01-01", 20.0),
row(1, "2015-01-06", 10.0),
row(1, "2015-01-07", 25.0),
row(1, "2015-01-12", 30.0),
row(2, "2015-01-01", 5.0),
row(2, "2015-01-03", 30.0),
row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))
A small helper and window definition:
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400
Finally query:
w = (Window()
.partitionBy(col("id"))
.orderBy(col("start").cast("timestamp").cast("long"))
.rangeBetween(-days(7), 0))
df.select(col("*"), mean("some_value").over(w).alias("mean")).show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Far from pretty but works.
* Hive Language Manual, Types
Understand Spark WindowSpec#rangeBetween
rangeBetween
considers the actual values in the column. It will check which values are "in range" (including both start and end values). In your example, the current row is the start value and next row is the end value. Since the range is inclusive, all duplicate values will be counted as well.
For example, if the start and end values are 1 and 3 respectively. All values in this range (1,2,3) will be used in the sum.
This is in contrast to rowsBetween
. For this function, only the specified rows are counted. I.e., rowsBetween(Window.currentRow, 1)
will only consider the current and next rows, whether duplicates exists or not.
Any difference between current row and 0 preceding/following in windowing clause of Oracle analytic functions?
It doesn't really matter which you use. They are two different ways of expressing the windowing, but the optimizer will perform the query the same way. The term "current row" is one that is common to multiple databases with analytic functions, not just Oracle. It's more of a stylistic difference, in the same way that some people prefer count(*) over count(1).
Related Topics
Select Where Count of One Field Is Greater Than One
Oracle Convert Seconds to Hours:Minutes:Seconds
How to Fill Missing Dates by Groups in a Table in SQL
How to Select Columns from a Table Which Have Non Null Values
Insert into Table Without Specifying Column Names
Oracle Replace() Function Isn't Handling Carriage-Returns & Line-Feeds
How to Create an Alias of Database in SQL Server
SQL How to Convert Row with Date Range to Many Rows with Each Date
Exec Stored Procedure into Dynamic Temp Table
How to Execute SQL Statements Saved in a Table with T-Sql
How to Select Bottom Most Rows
Postgres: Define a Default Value for Cast Failures
How to Convert Date to a Format 'Mm/Dd/Yyyy'
The Version of SQL Server in Use Does Not Support Datatype Datetime2
Sql: How to Use Union and Order by a Specific Select
SQL Update Query Syntax with Inner Join
SQL 2005 Cte VS Temp Table Performance When Used in Joins of Other Tables