Bigquery SQL for Sliding Window Aggregate

Bigquery SQL for sliding window aggregate

How about this:

SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
SELECT changes,
LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
login,
week
FROM (
SELECT SUM(payload_pull_request_changed_files) changes,
UTC_USEC_TO_WEEK(created_at, 1) week,
actor_attributes_login login,
FROM [publicdata:samples.github_timeline]
WHERE payload_pull_request_changed_files > 0
GROUP BY week, login
))
HAVING changes28days > 0

For each user it counts how many changes they have submitted per week. Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. Then we just add those 4 weeks to see how many changes were submitted on the last 28 days.

Now you can wrap everything in a new query to filter users with changes>X, and count them.

BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)

The BigQuery documentation doesn't do a good job of explaining the complexity of window functions that the tool supports because it doesn't specify what expressions can appear after ROWS or RANGE. It actually supports the SQL 2003 standard for window functions, which you can find documented other places on the web, such as here.

That means you can get the effect you want with a single window function. The range is 27 because it's how many rows before the current one to include in the sum.

SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;

A RANGE bound can also be extremely useful. If your table was missing dates for some user, then 27 PRECEDING rows would go back more than 27 days, but RANGE will produce a window based on the date values themselves. In the following query, the date field is a BigQuery TIMESTAMP and the range is specified in microseconds. I'd advise that whenever you do date math like this in BigQuery, you test it thoroughly to make sure it's giving you the expected answer.

SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date RANGE BETWEEN 27 * 24 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;

Sliding window aggregate Big Query 15 minute aggregation

Below is for BigQuery Standard SQL

Note: you provided simplified example of your data and below follows it - so instead of each 15 minutes aggregation, it uses each 2 sec aggregation. This is for you to be able to easy test / play with it. It is easily can be adjusted to 15 minutes by changing SECOND to MINUTE in 3 places and 2 to 15 in 3 places. Also this example uses TIME data type for time field as it is in your example so it is limited to just 24 hour period - most likely in your real data you have DATETIME or TIMESTAMP. In this case you will also need to replace all TIME_* functions with respective DATETIME_* or TIMESTAMP_* functions

So, finally - the query is:

#standardSQL
WITH `project.dataset.table` AS (
SELECT TIME '00:00:00' time, 31 viewCount UNION ALL
SELECT TIME '00:00:01', 44 UNION ALL
SELECT TIME '00:00:02', 78 UNION ALL
SELECT TIME '00:00:03', 71 UNION ALL
SELECT TIME '00:00:04', 72 UNION ALL
SELECT TIME '00:00:05', 73 UNION ALL
SELECT TIME '00:00:06', 64 UNION ALL
SELECT TIME '00:00:07', 70
),
period AS (
SELECT MIN(time) min_time, MAX(time) max_time, TIME_DIFF(MAX(time), MIN(time), SECOND) diff
FROM `project.dataset.table`
),
checkpoints AS (
SELECT TIME_ADD(min_time, INTERVAL step SECOND) start_time, TIME_ADD(min_time, INTERVAL step + 2 SECOND) end_time
FROM period, UNNEST(GENERATE_ARRAY(0, diff + 2, 2)) step
)
SELECT start_time time, SUM(viewCount) viewCount
FROM checkpoints c
JOIN `project.dataset.table` t
ON t.time >= c.start_time AND t.time < c.end_time
GROUP BY start_time
ORDER BY start_time, time

and result is:

Row time        viewCount    
1 00:00:00 75
2 00:00:02 149
3 00:00:04 145
4 00:00:06 134

Sliding window aggregate for year-week in bigquery

Below is for BigQuery Standard SQL

#standardSQL
WITH weeks AS (
SELECT 100* year + week year_week
FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year,
UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
LEFT JOIN `project.dataset.table` d
USING(year_week, run_id)
)
SELECT * FROM (
SELECT run_id, year_week,
SUM(value) OVER(win) aggregate_sum
FROM temp
WINDOW win AS (
PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
)
)
WHERE NOT aggregate_sum IS NULL

You can test / play with above using dummy data from your question as below

#standardSQL
WITH `project.dataset.table` AS (
SELECT '001' run_id, 201451 year_week, 5 value UNION ALL
SELECT '001', 201452, 8 UNION ALL
SELECT '001', 201501, 1 UNION ALL
SELECT '001', 201505, 5
), weeks AS (
SELECT 100* year + week year_week
FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year,
UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
LEFT JOIN `project.dataset.table` d
USING(year_week, run_id)
)
SELECT * FROM (
SELECT run_id, year_week,
SUM(value) OVER(win) aggregate_sum
FROM temp
WINDOW win AS (
PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
)
)
WHERE NOT aggregate_sum IS NULL
-- ORDER BY run_id, year_week

with result as

Row run_id  year_week   aggregate_sum    
1 001 201447 5
2 001 201448 13
3 001 201449 14
4 001 201450 14
5 001 201451 14
6 001 201452 9
7 001 201501 6
8 001 201502 5
9 001 201503 5
10 001 201504 5
11 001 201505 5
12 003 201348 8
13 003 201349 9
14 003 201350 9
15 003 201351 9
16 003 201352 9
17 003 201401 6
18 003 201402 5
19 003 201403 5
20 003 201404 5
21 003 201405 5

note; this is for - I only have 5 years, 2013, 2014, 2015, 2016 and 2017 but can easily be extended in weeks CTE

BigQuery: Computing aggregate over window of time for each person

Here is an efficient succinct way to do it that exploits the ordered structure of timestamps.

SELECT
user,
MAX(per_hour) AS max_event_per_hour
FROM
(
SELECT
user,
COUNT(*) OVER (PARTITION BY user ORDER BY timestamp RANGE BETWEEN 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) as per_hour,
timestamp
FROM
[dataset_example_in_question_user_timestamps]
)
GROUP BY user

BigQuery - Aggregate by 2 months window

Consider below approach

SELECT FORMAT_DATE('%b %Y', 
DATE(EXTRACT(YEAR FROM CAST(date AS DATE)), 2 * DIV(EXTRACT(MONTH FROM CAST(date AS DATE)) - 1, 2) + 1, 1)
) AS Date,
Market,
SUM(Rev) AS Rev,
FROM `data_table`
GROUP BY 1, 2

with output like below

Sample Image

Google BigQuery Standard SQL - Sales Sliding Window

Tumbling window is easier to do, it is just a regular GROUP BY at hourly intervals:

SELECT 
TIMESTAMP_TRUNC(orderPlacedTimestamp, HOUR),
SUM(orderTotals.grandTotalNet)
FROM T
GROUP BY 1

For sliding window, I would first normalize timestamps to 5 minute intervals using following:

  1. TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE) to get to minute boundary
  2. UNIX_SECONDS to convert to seconds since epoch
  3. Divide by 60 to get minutes
  4. Divide by 5 to get 5 minute interval
  5. Round to integer:

CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)

Now you can use standard OVER() clause to get 10 minute window, which means 2 such intervals at a time, and in order to get start time, use FIRST_VALUE analytic function:

SELECT
orderId,
TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)) startWindowTime,
TIMESTAMP_ADD(TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)),
INTERVAL 10 MINUTE) endWindowTime,
SUM(grandTotalNet) OVER(w)
FROM (
SELECT
*,
CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
AS ts_5min
FROM t
)
WINDOW w AS (ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW)


Related Topics



Leave a reply



Submit