Bigquery SQL for Sliding Window Aggregate

Bigquery SQL for sliding window aggregate

How about this:

SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
  SELECT changes,
         LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
         LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
         LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
         login,
         week
  FROM (
    SELECT SUM(payload_pull_request_changed_files) changes, 
           UTC_USEC_TO_WEEK(created_at, 1) week,
           actor_attributes_login login,
    FROM [publicdata:samples.github_timeline]
    WHERE payload_pull_request_changed_files > 0
    GROUP BY week, login
))
HAVING changes28days > 0

For each user it counts how many changes they have submitted per week. Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. Then we just add those 4 weeks to see how many changes were submitted on the last 28 days.

Now you can wrap everything in a new query to filter users with changes>X, and count them.

BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)

The BigQuery documentation doesn't do a good job of explaining the complexity of window functions that the tool supports because it doesn't specify what expressions can appear after ROWS or RANGE. It actually supports the SQL 2003 standard for window functions, which you can find documented other places on the web, such as here.

That means you can get the effect you want with a single window function. The range is 27 because it's how many rows before the current one to include in the sum.

SELECT spend,
       SUM(spend) OVER (PARTITION BY user ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW),
       user,
       date
FROM user_spend;

A RANGE bound can also be extremely useful. If your table was missing dates for some user, then 27 PRECEDING rows would go back more than 27 days, but RANGE will produce a window based on the date values themselves. In the following query, the date field is a BigQuery TIMESTAMP and the range is specified in microseconds. I'd advise that whenever you do date math like this in BigQuery, you test it thoroughly to make sure it's giving you the expected answer.

SELECT spend,
       SUM(spend) OVER (PARTITION BY user ORDER BY date RANGE BETWEEN 27 * 24 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW),
       user,
       date
FROM user_spend;

Sliding window aggregate Big Query 15 minute aggregation

Below is for BigQuery Standard SQL

Note: you provided simplified example of your data and below follows it - so instead of each 15 minutes aggregation, it uses each 2 sec aggregation. This is for you to be able to easy test / play with it. It is easily can be adjusted to 15 minutes by changing SECOND to MINUTE in 3 places and 2 to 15 in 3 places. Also this example uses TIME data type for time field as it is in your example so it is limited to just 24 hour period - most likely in your real data you have DATETIME or TIMESTAMP. In this case you will also need to replace all TIME_* functions with respective DATETIME_* or TIMESTAMP_* functions

So, finally - the query is:

#standardSQL
WITH `project.dataset.table` AS (
  SELECT TIME '00:00:00' time, 31 viewCount UNION ALL
  SELECT TIME '00:00:01', 44 UNION ALL
  SELECT TIME '00:00:02', 78 UNION ALL
  SELECT TIME '00:00:03', 71 UNION ALL
  SELECT TIME '00:00:04', 72 UNION ALL
  SELECT TIME '00:00:05', 73 UNION ALL
  SELECT TIME '00:00:06', 64 UNION ALL
  SELECT TIME '00:00:07', 70 
),
period AS (
  SELECT MIN(time) min_time, MAX(time) max_time, TIME_DIFF(MAX(time), MIN(time), SECOND) diff
  FROM `project.dataset.table`
),
checkpoints AS (
  SELECT TIME_ADD(min_time, INTERVAL step SECOND) start_time, TIME_ADD(min_time, INTERVAL step + 2 SECOND) end_time
  FROM period, UNNEST(GENERATE_ARRAY(0, diff + 2, 2)) step
)
SELECT start_time time, SUM(viewCount) viewCount
FROM checkpoints c
JOIN `project.dataset.table` t
ON t.time >= c.start_time AND t.time < c.end_time
GROUP BY start_time
ORDER BY start_time, time

and result is:

Row time        viewCount    
1   00:00:00    75   
2   00:00:02    149  
3   00:00:04    145  
4   00:00:06    134

Sliding window aggregate for year-week in bigquery

Below is for BigQuery Standard SQL

#standardSQL
WITH weeks AS (
  SELECT 100* year + week year_week
  FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year, 
  UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
  SELECT i.run_id, w.year_week, d.year_week week2, value
  FROM  weeks w 
  CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
  LEFT JOIN `project.dataset.table` d
  USING(year_week, run_id)
)
SELECT * FROM (
  SELECT run_id, year_week, 
    SUM(value) OVER(win) aggregate_sum
  FROM temp
  WINDOW win AS (
    PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
  )
)
WHERE NOT aggregate_sum IS NULL

You can test / play with above using dummy data from your question as below

#standardSQL
WITH `project.dataset.table` AS (
  SELECT '001' run_id, 201451 year_week, 5 value UNION ALL
  SELECT '001', 201452, 8 UNION ALL
  SELECT '001', 201501, 1 UNION ALL
  SELECT '001', 201505, 5 
), weeks AS (
  SELECT 100* year + week year_week
  FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year, 
  UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
  SELECT i.run_id, w.year_week, d.year_week week2, value
  FROM  weeks w 
  CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
  LEFT JOIN `project.dataset.table` d
  USING(year_week, run_id)
)
SELECT * FROM (
  SELECT run_id, year_week, 
    SUM(value) OVER(win) aggregate_sum
  FROM temp
  WINDOW win AS (
    PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
  )
)
WHERE NOT aggregate_sum IS NULL
-- ORDER BY run_id, year_week

with result as

Row run_id  year_week   aggregate_sum    
1   001     201447      5    
2   001     201448      13   
3   001     201449      14   
4   001     201450      14   
5   001     201451      14   
6   001     201452      9    
7   001     201501      6    
8   001     201502      5        
9   001     201503      5    
10  001     201504      5    
11  001     201505      5
12  003     201348      8    
13  003     201349      9    
14  003     201350      9    
15  003     201351      9    
16  003     201352      9    
17  003     201401      6    
18  003     201402      5    
19  003     201403      5    
20  003     201404      5    
21  003     201405      5

note; this is for - I only have 5 years, 2013, 2014, 2015, 2016 and 2017 but can easily be extended in weeks CTE

BigQuery: Computing aggregate over window of time for each person

Here is an efficient succinct way to do it that exploits the ordered structure of timestamps.

SELECT
  user,
  MAX(per_hour) AS max_event_per_hour
FROM
(
  SELECT 
    user,
    COUNT(*) OVER (PARTITION BY user ORDER BY timestamp RANGE BETWEEN 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) as per_hour,
    timestamp
  FROM 
    [dataset_example_in_question_user_timestamps]
)
GROUP BY user

BigQuery - Aggregate by 2 months window

Consider below approach

SELECT FORMAT_DATE('%b %Y', 
    DATE(EXTRACT(YEAR FROM CAST(date AS DATE)), 2 * DIV(EXTRACT(MONTH FROM CAST(date AS DATE)) - 1, 2) + 1, 1)
  ) AS Date,
  Market,
  SUM(Rev) AS Rev,
FROM `data_table` 
GROUP BY 1, 2

with output like below

Sample Image

Google BigQuery Standard SQL - Sales Sliding Window

Tumbling window is easier to do, it is just a regular GROUP BY at hourly intervals:

SELECT 
  TIMESTAMP_TRUNC(orderPlacedTimestamp, HOUR), 
  SUM(orderTotals.grandTotalNet)
FROM T
GROUP BY 1

For sliding window, I would first normalize timestamps to 5 minute intervals using following:

TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE) to get to minute boundary
UNIX_SECONDS to convert to seconds since epoch
Divide by 60 to get minutes
Divide by 5 to get 5 minute interval
Round to integer:

CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)

Now you can use standard OVER() clause to get 10 minute window, which means 2 such intervals at a time, and in order to get start time, use FIRST_VALUE analytic function:

SELECT
  orderId,
  TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)) startWindowTime,
  TIMESTAMP_ADD(TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)),
                INTERVAL 10 MINUTE) endWindowTime,
  SUM(grandTotalNet) OVER(w)
FROM (
  SELECT 
  *,
  CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64) 
    AS ts_5min
  FROM t
)
WINDOW w AS (ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW)

Bigquery SQL for Sliding Window Aggregate