Bigquery SQL for 28-Day Sliding Window Aggregate (Without Writing 28 Lines of SQL)

BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)

The BigQuery documentation doesn't do a good job of explaining the complexity of window functions that the tool supports because it doesn't specify what expressions can appear after ROWS or RANGE. It actually supports the SQL 2003 standard for window functions, which you can find documented other places on the web, such as here.

That means you can get the effect you want with a single window function. The range is 27 because it's how many rows before the current one to include in the sum.

SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;

A RANGE bound can also be extremely useful. If your table was missing dates for some user, then 27 PRECEDING rows would go back more than 27 days, but RANGE will produce a window based on the date values themselves. In the following query, the date field is a BigQuery TIMESTAMP and the range is specified in microseconds. I'd advise that whenever you do date math like this in BigQuery, you test it thoroughly to make sure it's giving you the expected answer.

SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date RANGE BETWEEN 27 * 24 * 60 * 60 * 1000000 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;

Bigquery SQL for sliding window aggregate

How about this:

SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
SELECT changes,
LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
login,
week
FROM (
SELECT SUM(payload_pull_request_changed_files) changes,
UTC_USEC_TO_WEEK(created_at, 1) week,
actor_attributes_login login,
FROM [publicdata:samples.github_timeline]
WHERE payload_pull_request_changed_files > 0
GROUP BY week, login
))
HAVING changes28days > 0

For each user it counts how many changes they have submitted per week. Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. Then we just add those 4 weeks to see how many changes were submitted on the last 28 days.

Now you can wrap everything in a new query to filter users with changes>X, and count them.

BigQuery: Computing aggregate over window of time for each person

Here is an efficient succinct way to do it that exploits the ordered structure of timestamps.

SELECT
user,
MAX(per_hour) AS max_event_per_hour
FROM
(
SELECT
user,
COUNT(*) OVER (PARTITION BY user ORDER BY timestamp RANGE BETWEEN 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) as per_hour,
timestamp
FROM
[dataset_example_in_question_user_timestamps]
)
GROUP BY user

Window Function Over clause

See user 'sprocket' answer at https://stackoverflow.com/a/27574474:

ROWS and RANGE allow the window function to look at a user defined window of rows, for example the previous 27 to run a moving average:

SELECT spend,
SUM(spend) OVER (PARTITION BY user ORDER BY date ROWS BETWEEN 27 PRECEDING AND CURRENT ROW),
user,
date
FROM user_spend;

BigQuery : is it possible to execute another query inside an UDF?

I need to produce a report that tells me for each day, the no. of
unique users in the last 30 days including that day.

Below should do this

SELECT
calendar_day,
EXACT_COUNT_DISTINCT(userID) AS unique_users
FROM (
SELECT calendar_day, userID
FROM YourTable
CROSS JOIN (
SELECT DATE(DATE_ADD('2016-08-08', pos - 1, "DAY")) AS calendar_day
FROM (
SELECT ROW_NUMBER() OVER() AS pos, *
FROM (FLATTEN((
SELECT SPLIT(RPAD('', 1 + DATEDIFF('2016-09-08', '2016-08-08'), '.'),'') AS h
FROM (SELECT NULL)),h
)))
) AS calendar
WHERE DATEDIFF(calendar_day, dt) BETWEEN 0 AND 29
)
GROUP BY calendar_day
ORDER BY calendar_day DESC

It assumes YourTable has userID and dt fields (like below for example)

dt          userID   
2016-09-08 1
2016-09-08 2
...

And you can control:

- reporting dates range by changing respectively 2016-08-08 and 2016-09-08
- aggregation size by changing 29 in BETWEEN 0 AND 29

bigquery aggregate for daily basis

Below is for BigQuery Standard SQL



#standardSQL
WITH dates AS (
SELECT DISTINCT DATE(UpdatedDate) UpdatedDay
FROM `project.dataset.test`
),
qualified AS (
SELECT DATE(UpdatedDate) UpdatedDay, InvNo, ARRAY_AGG(amount ORDER BY UpdatedDate DESC LIMIT 1)[SAFE_OFFSET(0)] amount
FROM `project.dataset.test`
GROUP BY UpdatedDay, InvNo
)
SELECT UpdatedDay, SUM(amount) amount
FROM (
SELECT d.UpdatedDay UpdatedDay, InvNo, ARRAY_AGG(amount ORDER BY q.UpdatedDay DESC LIMIT 1)[SAFE_OFFSET(0)] amount
FROM dates d
JOIN qualified q
ON q.UpdatedDay <= d.UpdatedDay
GROUP BY UpdatedDay, InvNo
)
GROUP BY UpdatedDay
-- ORDER BY UpdatedDay

You can test / play with this with below dummy data from your question

#standardSQL
WITH `project.dataset.test` AS (
SELECT TIMESTAMP '2017-10-01 01:00:00' UpdatedDate, 'T001' InvNo, 100 amount UNION ALL
SELECT TIMESTAMP '2017-10-02 01:00:00', 'T002', 200 UNION ALL
SELECT TIMESTAMP '2017-10-02 02:00:00', 'T002', 300 UNION ALL
SELECT TIMESTAMP '2017-10-04 01:00:00', 'T003', 400 UNION ALL
SELECT TIMESTAMP '2017-10-05 01:00:00', 'T002', 500 UNION ALL
SELECT TIMESTAMP '2017-10-05 02:00:00', 'T003', 500
),
dates AS (
SELECT DISTINCT DATE(UpdatedDate) UpdatedDay
FROM `project.dataset.test`
),
qualified AS (
SELECT DATE(UpdatedDate) UpdatedDay, InvNo, ARRAY_AGG(amount ORDER BY UpdatedDate DESC LIMIT 1)[SAFE_OFFSET(0)] amount
FROM `project.dataset.test`
GROUP BY UpdatedDay, InvNo
)
SELECT UpdatedDay, SUM(amount) amount
FROM (
SELECT d.UpdatedDay UpdatedDay, InvNo, ARRAY_AGG(amount ORDER BY q.UpdatedDay DESC LIMIT 1)[SAFE_OFFSET(0)] amount
FROM dates d
JOIN qualified q
ON q.UpdatedDay <= d.UpdatedDay
GROUP BY UpdatedDay, InvNo
)
GROUP BY UpdatedDay
ORDER BY UpdatedDay

Result is as expected

UpdatedDay  amount   
2017-10-01 100
2017-10-02 400
2017-10-04 800
2017-10-05 1100

Moving Distinct Count in Big Query (SQL syntax)

Consider below approach

select store, date, 
( select count(distinct item)
from t.items item
) distinct_items_count
from (
select store, date, any_value(items) items
from (
select store, date,
array_agg(item_code) over(partition by store order by unix_date(date) range between 13 preceding and current row) items
from your_table
)
group by store, date
) t


Related Topics



Leave a reply



Submit