Bigquery SQL for sliding window aggregate
How about this:
SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
SELECT changes,
LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
login,
week
FROM (
SELECT SUM(payload_pull_request_changed_files) changes,
UTC_USEC_TO_WEEK(created_at, 1) week,
actor_attributes_login login,
FROM [publicdata:samples.github_timeline]
WHERE payload_pull_request_changed_files > 0
GROUP BY week, login
))
HAVING changes28days > 0
For each user it counts how many changes they have submitted per week. Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. Then we just add those 4 weeks to see how many changes were submitted on the last 28 days.
Now you can wrap everything in a new query to filter users with changes>X, and count them.
BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)
The BigQuery documentation doesn't do a good job of explaining the complexity of window functions that the tool supports because it doesn't specify what expressions can appear after ROWS or RANGE. It actually supports the SQL 2003 standard for window functions, which you can find documented other places on the web, such as here.
That means you can get the effect you want with a single window function. The range is 27 because it's how many rows before the current one to include in the sum.
SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;
A RANGE bound can also be extremely useful. If your table was missing dates for some user, then 27 PRECEDING rows would go back more than 27 days, but RANGE will produce a window based on the date values themselves. In the following query, the date field is a BigQuery TIMESTAMP and the range is specified in microseconds. I'd advise that whenever you do date math like this in BigQuery, you test it thoroughly to make sure it's giving you the expected answer.
SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date RANGE BETWEEN 27 * 24 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;
Sliding window aggregate Big Query 15 minute aggregation
Below is for BigQuery Standard SQL
Note: you provided simplified example of your data and below follows it - so instead of each 15 minutes aggregation, it uses each 2 sec aggregation. This is for you to be able to easy test / play with it. It is easily can be adjusted to 15 minutes by changing SECOND
to MINUTE
in 3 places and 2
to 15
in 3 places. Also this example uses TIME
data type for time field as it is in your example so it is limited to just 24 hour period - most likely in your real data you have DATETIME
or TIMESTAMP
. In this case you will also need to replace all TIME_*
functions with respective DATETIME_*
or TIMESTAMP_*
functions
So, finally - the query is:
#standardSQL
WITH `project.dataset.table` AS (
SELECT TIME '00:00:00' time, 31 viewCount UNION ALL
SELECT TIME '00:00:01', 44 UNION ALL
SELECT TIME '00:00:02', 78 UNION ALL
SELECT TIME '00:00:03', 71 UNION ALL
SELECT TIME '00:00:04', 72 UNION ALL
SELECT TIME '00:00:05', 73 UNION ALL
SELECT TIME '00:00:06', 64 UNION ALL
SELECT TIME '00:00:07', 70
),
period AS (
SELECT MIN(time) min_time, MAX(time) max_time, TIME_DIFF(MAX(time), MIN(time), SECOND) diff
FROM `project.dataset.table`
),
checkpoints AS (
SELECT TIME_ADD(min_time, INTERVAL step SECOND) start_time, TIME_ADD(min_time, INTERVAL step + 2 SECOND) end_time
FROM period, UNNEST(GENERATE_ARRAY(0, diff + 2, 2)) step
)
SELECT start_time time, SUM(viewCount) viewCount
FROM checkpoints c
JOIN `project.dataset.table` t
ON t.time >= c.start_time AND t.time < c.end_time
GROUP BY start_time
ORDER BY start_time, time
and result is:
Row time viewCount
1 00:00:00 75
2 00:00:02 149
3 00:00:04 145
4 00:00:06 134
Sliding window aggregate for year-week in bigquery
Below is for BigQuery Standard SQL
#standardSQL
WITH weeks AS (
SELECT 100* year + week year_week
FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year,
UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
LEFT JOIN `project.dataset.table` d
USING(year_week, run_id)
)
SELECT * FROM (
SELECT run_id, year_week,
SUM(value) OVER(win) aggregate_sum
FROM temp
WINDOW win AS (
PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
)
)
WHERE NOT aggregate_sum IS NULL
You can test / play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT '001' run_id, 201451 year_week, 5 value UNION ALL
SELECT '001', 201452, 8 UNION ALL
SELECT '001', 201501, 1 UNION ALL
SELECT '001', 201505, 5
), weeks AS (
SELECT 100* year + week year_week
FROM UNNEST([2013, 2014, 2015, 2016, 2017]) year,
UNNEST(GENERATE_ARRAY(1, IF(EXTRACT(ISOWEEK FROM DATE(1+year,1,1)) = 1, 52, 53))) week
), temp AS (
SELECT i.run_id, w.year_week, d.year_week week2, value
FROM weeks w
CROSS JOIN (SELECT DISTINCT run_id FROM `project.dataset.table`) i
LEFT JOIN `project.dataset.table` d
USING(year_week, run_id)
)
SELECT * FROM (
SELECT run_id, year_week,
SUM(value) OVER(win) aggregate_sum
FROM temp
WINDOW win AS (
PARTITION BY run_id ORDER BY year_week ROWS BETWEEN CURRENT row AND 4 FOLLOWING
)
)
WHERE NOT aggregate_sum IS NULL
-- ORDER BY run_id, year_week
with result as
Row run_id year_week aggregate_sum
1 001 201447 5
2 001 201448 13
3 001 201449 14
4 001 201450 14
5 001 201451 14
6 001 201452 9
7 001 201501 6
8 001 201502 5
9 001 201503 5
10 001 201504 5
11 001 201505 5
12 003 201348 8
13 003 201349 9
14 003 201350 9
15 003 201351 9
16 003 201352 9
17 003 201401 6
18 003 201402 5
19 003 201403 5
20 003 201404 5
21 003 201405 5
note; this is for - I only have 5 years, 2013, 2014, 2015, 2016 and 2017
but can easily be extended in weeks CTE
BigQuery: Computing aggregate over window of time for each person
Here is an efficient succinct way to do it that exploits the ordered structure of timestamps.
SELECT
user,
MAX(per_hour) AS max_event_per_hour
FROM
(
SELECT
user,
COUNT(*) OVER (PARTITION BY user ORDER BY timestamp RANGE BETWEEN 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) as per_hour,
timestamp
FROM
[dataset_example_in_question_user_timestamps]
)
GROUP BY user
BigQuery - Aggregate by 2 months window
Consider below approach
SELECT FORMAT_DATE('%b %Y',
DATE(EXTRACT(YEAR FROM CAST(date AS DATE)), 2 * DIV(EXTRACT(MONTH FROM CAST(date AS DATE)) - 1, 2) + 1, 1)
) AS Date,
Market,
SUM(Rev) AS Rev,
FROM `data_table`
GROUP BY 1, 2
with output like below
Google BigQuery Standard SQL - Sales Sliding Window
Tumbling window is easier to do, it is just a regular GROUP BY
at hourly intervals:
SELECT
TIMESTAMP_TRUNC(orderPlacedTimestamp, HOUR),
SUM(orderTotals.grandTotalNet)
FROM T
GROUP BY 1
For sliding window, I would first normalize timestamps to 5 minute intervals using following:
TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE)
to get to minute boundaryUNIX_SECONDS
to convert to seconds since epoch- Divide by 60 to get minutes
- Divide by 5 to get 5 minute interval
- Round to integer:
CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
Now you can use standard OVER()
clause to get 10 minute window, which means 2 such intervals at a time, and in order to get start time, use FIRST_VALUE
analytic function:
SELECT
orderId,
TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)) startWindowTime,
TIMESTAMP_ADD(TIMESTAMP_SECONDS(FIRST_VALUE(ts_5min*5*60) OVER(w)),
INTERVAL 10 MINUTE) endWindowTime,
SUM(grandTotalNet) OVER(w)
FROM (
SELECT
*,
CAST(UNIX_SECONDS(TIMESTAMP_TRUNC(orderPlacedTimestamp, MINUTE))/60/5 AS INT64)
AS ts_5min
FROM t
)
WINDOW w AS (ORDER BY ts_5min RANGE BETWEEN 1 PRECEDING AND CURRENT ROW)
Related Topics
List All Sequences in a Postgres Db 8.1 with SQL
Database/SQL Tx - Detecting Commit or Rollback
Is a One Column Table Good Design
How to Join the Most Recent Row in One Table to Another Table
Best User Role Permissions Database Design Practice
SQL Server: Invalid Column Name
How to Add a Unique Constraint to a Postgresql Table, After It's Already Created
Identity_Insert Is Set to Off - How to Turn It On
The Best Way to Use a Db Table as a Job Queue (A.K.A Batch Queue or Message Queue)
Tsql: How to Get a List of Groups That a User Belongs to in Active Directory
Calling Oracle Stored Procedure with Output Parameter from SQL Server
Store Multiple Elements in JSON Files in Aws Athena
SQL Server Xp_Delete_File Not Deleting Files
Drop All Temporary Tables for an Instance
How to Gracefully Include Formatted SQL Strings in an R Script