Window Functions: Last_Value(Order by ... Asc) Same as Last_Value(Order by ... Desc)

Window Functions: last_value(ORDER BY ... ASC) same as last_value(ORDER BY ... DESC)

After one year I have got the solution:

Take this statement:

SELECT
    id,
    array_accum(value) over (partition BY session_ID ORDER BY id)      AS window_asc,
    first_value(value) over (partition BY session_ID ORDER BY id)      AS first_value_window_asc,
    last_value(value) over (partition BY session_ID ORDER BY id)       AS last_value_window_asc,
    array_accum(value) over (partition BY session_ID ORDER BY id DESC) AS window_desc,
    first_value(value) over (partition BY session_ID ORDER BY id DESC) AS first_value_window_desc,
    last_value(value) over (partition BY session_ID ORDER BY id DESC)  AS last_value_window_desc
FROM
    test
ORDER BY
    id

This gives

id  window_asc     first_value_window_asc  last_value_window_asc  window_desc    first_value_window_desc  last_value_window_desc  
--  -------------  ----------------------  ---------------------  -------------  -----------------------  ----------------------  
0   {100}          100                     100                    {140,120,100}  140                      100                     
1   {100,120}      100                     120                    {140,120}      140                      120                     
2   {100,120,140}  100                     140                    {140}          140                      140                     
3   {900}          900                     900                    {500,800,900}  500                      900                     
4   {900,800}      900                     800                    {500,800}      500                      800                     
5   {900,800,500}  900                     500                    {500}          500                      500

The array_accum shows the used window. There you can see the first and the current last value of the window.

What happenes shows the execution plan:

"Sort  (cost=444.23..449.08 rows=1940 width=12)"
"  Sort Key: id"
"  ->  WindowAgg  (cost=289.78..338.28 rows=1940 width=12)"
"        ->  Sort  (cost=289.78..294.63 rows=1940 width=12)"
"              Sort Key: session_id, id"
"              ->  WindowAgg  (cost=135.34..183.84 rows=1940 width=12)"
"                    ->  Sort  (cost=135.34..140.19 rows=1940 width=12)"
"                          Sort Key: session_id, id"
"                          ->  Seq Scan on test  (cost=0.00..29.40 rows=1940 width=12)"

There you can see: First there is an ORDER BY id for the first three window functions.

This gives (as stated in question)

id  window_asc     first_value_window_asc  last_value_window_asc  
--  -------------  ----------------------  ---------------------  
3   {900}          900                     900                    
4   {900,800}      900                     800                    
5   {900,800,500}  900                     500                    
0   {100}          100                     100                    
1   {100,120}      100                     120                    
2   {100,120,140}  100                     140

Then you can see another sort: ORDER BY id DESC for the next three window functions. This sort gives:

id  window_asc     first_value_window_asc  last_value_window_asc  
--  -------------  ----------------------  ---------------------  
5   {900,800,500}  900                     500                    
4   {900,800}      900                     800                    
3   {900}          900                     900                    
2   {100,120,140}  100                     140                    
1   {100,120}      100                     120                    
0   {100}          100                     100

With this sorting the DESC window function are executed. The array_accum column shows the resulting windows:

id  window_desc    
--  -------------  
5   {500}          
4   {500,800}      
3   {500,800,900}  
2   {140}          
1   {140,120}      
0   {140,120,100}

The resulting (first_value DESC and) last_value DESC is now absolutely identical to the last_value ASC:

id  window_asc     last_value_window_asc  window_desc    last_value_window_desc  
--  -------------  ---------------------  -------------  ----------------------  
5   {900,800,500}  500                    {500}          500                     
4   {900,800}      800                    {500,800}      800                     
3   {900}          900                    {500,800,900}  900                     
2   {100,120,140}  140                    {140}          140                     
1   {100,120}      120                    {140,120}      120                     
0   {100}          100                    {140,120,100}  100

Now it became clear to me why last_value ASC is equal to last_value DESC. It's because the second ORDER of the window functions which gives an inverted window.

(The last sort of the execution plan ist the last ORDER BY of the statement.)

As a little bonus: This query shows a little optimization potential: If you call the DESC windows first and then the ASC ones you do not need the third sort. It is in the right sort at this moment.

FIRST_VALUE working correctly but LAST_VALUE not giving desired results

As mentioned by @JeroenMostert, when you add an ORDER BY to an OVER clause (of an analytical function which takes ROWS/RANGE), the default window is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, therefore the last value in the window is always the current row.

What you want is ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, so you need to add that explicitly. Unintuitive, but that is how it is.

You should also add this to FIRST_VALUE for performance reasons.

SELECT 
    DISTINCT *,
    ROW_NUMBER() OVER (PARTITION BY dog_guid ORDER BY created_at ASC) AS test_number,
    LAG(created_at) OVER (PARTITION BY dog_guid ORDER BY created_at ASC) AS previous_test_date,
    CAST(DATEDIFF(SECOND, LAG(created_at) OVER (PARTITION BY dog_guid ORDER BY created_at ASC), created_at)/(60*60*24) AS FLOAT) AS date_diff_days,
    CAST(DATEDIFF(SECOND, LAG(created_at) OVER (PARTITION BY dog_guid ORDER BY created_at ASC), created_at)/(60) AS FLOAT) AS date_diff_mins,
    FIRST_VALUE(created_at) OVER (PARTITION BY dog_guid ORDER BY created_at ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_test_date,
    LAST_VALUE(created_at) OVER (PARTITION BY dog_guid ORDER BY created_at ASC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_test_date
FROM    
    complete_tests c
WHERE
    dog_guid IS NOT NULL

Strictly speaking, you could change it to FIRST_VALUE and DESC, but this is going to bad for performance, firstly because it will need a second sort, and secondly because the window is still RANGE which requires an on-disk worktable.

Should first_value() over (order by something asc) be the same as last_value () over (order by something desc)?

You forgot about a moving window used for analytical functions. See the difference (used rows between unbounded preceding and unbounded following):

SQL> select random_date, trunc(random_date,'MM') random_month, random_value,
  2  first_value(random_value) over (partition by trunc(random_date,'MM') order by random_date desc rows between unbounded preceding and unbounded following) first_,
  3  last_value(random_value) over (partition by trunc(random_date,'MM') order by random_date rows between unbounded preceding and unbounded following) last_
  4  from
  5  (select to_date(round (dbms_random.value (1, 28))
  6          || '-'
  7          || round (dbms_random.value (02, 03))
  8          || '-'
  9          || round (dbms_random.value (2014, 2014)),
 10          'DD-MM-YYYY') + level - 1 random_date,
 11          round(100*(dbms_random.value)) random_value
 12  from dual
 13  connect by level <= 10) order by 2, 1;

RANDOM_DATE RANDOM_MONTH RANDOM_VALUE     FIRST_      LAST_
----------- ------------ ------------ ---------- ----------
02.02.2014  01.02.2014             93         75         75
09.02.2014  01.02.2014             78         75         75
11.02.2014  01.02.2014             69         75         75
12.02.2014  01.02.2014             13         75         75
21.02.2014  01.02.2014             91         75         75
25.02.2014  01.02.2014             75         75         75
01.03.2014  01.03.2014             54         80         80
15.03.2014  01.03.2014             37         80         80
16.03.2014  01.03.2014             92         80         80
17.03.2014  01.03.2014             80         80         80

10 rows selected

Sorting behavior specification in order by for window function in GBQ

Sorting order should be specified on per column basis with ASC being default, so can be omitted. So, Yes - you should use DESC for each column as in below

SELECT partitionDate,
  createdUTC,
  ROW_NUMBER() OVER(PARTITION BY externalid ORDER BY partitionDate DESC, createdUTC DESC NULLS LAST)

LAST_VALUE() with ASC and FIRST_VALUE with DESC return different results

There's a subtlety on how OVER() functions work when they have an (ORDER BY): They work incrementally.

See this query:

SELECT x, y, 
       FIRST_VALUE(x) OVER(ORDER BY y) first, 
       LAST_VALUE(x) OVER(ORDER BY y DESC) last,
       SUM(x) OVER() plain_sum_over, 
       SUM(x) OVER(ORDER BY y) sum_over_order
FROM (SELECT 1 x, 1 y),(SELECT 2 x, 2 y),(SELECT 3 x, 3 y),(SELECT 4 x, 4 y)

Sample Image

The plain_sum_over and sum_over_order reveal the secret: With an ORDER BY you get incremental results - and that's what you are witnessing in your results.

first_value and last_value for each user id

You need a PARTITION BY clause, which generates the frame per user_id

SELECT DISTINCT on (user_id)
    user_id, 
    first_value(sj.at) OVER (PARTITION BY user_id ORDER BY sj.at ASC),
    last_value(sj.to) OVER (PARTITION BY user_id ORDER BY sj.at DESC)
FROM 
    scheduled_jobs sj
WHERE 
    sj.at BETWEEN CURRENT_DATE + 3 and CURRENT_DATE + 4

Addionally: Please be careful by using last_value. Sometimes it would not work as expected. See here

You should use first_value with DESC ordering instead:

first_value(scheduled_jobs.at) over (partition by user_id order by scheduled_jobs.at DESC)

different data window for FIRST_VALUE and LAST_VALUE

Oh, i get it now.
Both functions have window like UNBOUNDED PRECEDING AND CURRENT ROW but the FIRST_VALUE gets the first row so no matter how many will follow it always shows the first result, and the LAST_VALUE shows always the CURRENT_ROW.
Thank You all!

In hive window, what would happen if the value of CURRENT ROW is smaller than that of UNBOUNDED PRECEDING

It works as designed and according to the standard, the same behavior is in other databases.

It is easier to find specification for Hive and other databases like Oracle than standard document (for free). For example see "Windowing Specifications in HQL" and Oracle "Window Function Frame Specification"

First the partition is ordered, then bounds are calculated and frame between bounds is used. Frame is taken according to the ORDER BY, not always >=bound1 and <=bound2.

For order by DESC Bound1>=Row.value>=Bound2. Frame includes rows from the partition start through the current row, including all peers of the current row (rows equal to the current row according to the ORDER BY clause).

For order by ASC Bound1<=Row.value<=Bound2.

UNBOUNDED PRECEDING:

The bound (bound1) is the first partition row (according to the order).

CURRENT ROW:

For ROWS, the bound (bound2) is the current row. For RANGE, the bound is the peers of the current row (rows with the same value as current row).

Also read this excellent explanation from Sybase:

The sort order of the ORDER BY values is a critical part of the test
for qualifying rows in a value-based frame; the numeric values alone
do not determine exclusion or inclusion

Window Functions: Last_Value(Order by ... Asc) Same as Last_Value(Order by ... Desc)