Bigquery: How to Group and Count Rows Within Rolling Timestamp Window

BigQuery: how to group and count rows within rolling timestamp window?

Below is for BigQuery Standard SQL (see Enabling Standard SQL

I am using ts as a field name (instead timestamp as it is in your example) and assume this field is of TIMESTAMP data type

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
)
SELECT 
  url, event_id, day, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling3daysEvents
FROM dailyAggregations
-- ORDER BY url, event_id, day

The value of 259200 is actually 3x24x3600 so sets 3 days range, so you can set whatever actual rolling period you need

BigQuery: how to perform rolling timestamp window group count that produces row for each day

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
),
calendar AS (
  SELECT day
  FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
)
SELECT 
  c.day, url, event_id, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling4daysEvents
FROM calendar AS c
LEFT JOIN dailyAggregations AS a
ON a.day = c.day

Counting row entries in BigQuery table grouped on a time interval using SQL

Consider below approach

select Product, 
  timestamp_trunc(Timestamp, minute) Timestamp,
  count(1) `Count`
from `mycompany.engagement.product_orders`
group by 1, 2

if applied to sample data in your question - output is

Sample Image

Group By Timestamp_Trunc including empty rows with '0' count

Try generating array of hours needed cross joining it with all the status codes and left joining with your results:

with mytable as (
    select timestamp '2021-10-18 19:00:00' as hour, 200 as statusCode, 1234 as averageDurationMs, 25 as count union all
    select '2021-10-18 21:00:00', 500, 4978, 6015 union all 
    select '2021-10-18 21:00:00', 404, 4987, 5984 union all 
    select '2021-10-18 21:00:00', 200, 5048, 11971 union all 
    select '2021-10-18 21:00:00', 401, 4976, 6030
)
select myhour, allCodes.statusCode, IFNULL(mytable.averageDurationMs, 0) as statusCode, IFNULL(mytable.count, 0) as averageDurationMs
from 
    UNNEST(GENERATE_TIMESTAMP_ARRAY(TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 23 HOUR), TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL 1 HOUR)) as myhour
CROSS JOIN 
    (SELECT DISTINCT statusCode FROM mytable) as allCodes
LEFT JOIN mytable ON myHour = mytable.hour AND allCodes.statusCode = mytable.statusCode

Sample Image

How to take aggregations for a repeating window in BigQuery

Consider below approach

select id, channel, 
  min(t_date) as start_date, 
  max(t_date) as end_date, 
  count(1) as appearances
from (
  select *, countif(new_group) over (partition by id order by t_date) group_id 
  from (
    select *, ifnull(channel != lag(channel) over win, true) new_group
    from temp
    window win as (partition by id order by t_date)
  )
)
group by id, channel, group_id

if applied to sample data in your question - output is

Sample Image

Get count of day types between two dates

Consider below approach

with your_table as (
  select date
  from unnest(generate_date_array("2021-02-13", "2021-03-30")) AS date  
)
select * from your_table
pivot (count(*) for format_date('%a', date) in ('Mon','Tue','Wed','Thu','Fri','Sat','Sun'))

with output

Sample Image

Or you can just simply do

select 
  format_date('%a', date) day_of_week, 
  count(*) counts
from your_table
group by day_of_week

with output

Sample Image

First row for each group

How to get first visit row for each user and resource?

In query you presented in question - should remove DESC in ORDER BY created_at DESC otherwise it returns last visit - not first

What is the best way to construct such query?

Another option would be to use ROW_NUMBER() as below

 SELECT
  user_id,
  endpoint_id,
  created_at
FROM (
  SELECT 
      user_id, 
      endpoint_id, 
      created_at,
      ROW_NUMBER() OVER(PARTITION BY user_id, endpoint_id ORDER BY created_at) AS first_created
  FROM [visits]
)
WHERE first_created = 1

... but this query will not work for big amount of data

This really depends. Resources Exceeded can happen If size of your user_id, endpoint_id partition is BIG enough (as ORDER BY requires all rows of partition to be on the same node).

If this is a case for you - you can use below trick

Step 1 - using JOIN

SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at 
FROM [visits] AS tab1
INNER JOIN (
  SELECT user_id, endpoint_id, MIN(created_at) AS min_time 
  FROM [visits] 
  GROUP BY user_id, endpoint_id
) AS tab2
ON  tab1.user_id = tab2.user_id 
AND tab1.endpoint_id = tab2.endpoint_id 
AND tab1.created_at = tab2.min_time

Step 2 - There is still something else to take care here - in case if you have duplicate entries for same user / resource. In this case you still need to extract only one row for each partition. See below final query

 SELECT user_id, endpoint_id, created_at
FROM (
  SELECT user_id, endpoint_id, created_at, 
    ROW_NUMBER() OVER (PARTITION BY user_id, endpoint_id) AS rn 
  FROM (
    SELECT tab1.user_id AS user_id, tab1.endpoint_id AS endpoint_id, tab1.created_at AS created_at 
    FROM [visits]  AS tab1
    INNER JOIN (
      SELECT user_id, endpoint_id, MIN(created_at) AS min_time 
      FROM [visits]  
      GROUP BY user_id, endpoint_id
    ) AS tab2
    ON  tab1.user_id = tab2.user_id 
    AND tab1.endpoint_id = tab2.endpoint_id 
    AND tab1.created_at = tab2.min_time
  )
)
WHERE rn = 1

and of course obvious and simplest Case - if those three fields are
the ONLY fields in [visits] table

SELECT user_id, endpoint_id, MIN(created_at) AS created_at 
FROM [visits]
GROUP BY user_id, endpoint_id

BigQuery groupby, add row with count of 0 for groupby variable if no rows found

Below for BigQuery Standard SQL

#standardSQL
SELECT team_id, game_id, this_zone, IFNULL(my_count, 0) AS my_count
FROM (
  SELECT DISTINCT
    rebounds.team_id,
    rebounds.game_id, 
    this_zone
  FROM t1, UNNEST(['zone1', 'zone2', 'zone3', 'zone4']) this_zone
) A 
LEFT JOIN (
  SELECT
    rebounds.team_id,
    rebounds.game_id,
    CASE
      WHEN distance < 4 THEN 'zone1'
      WHEN NOT some_bool THEN 'zone2'
      WHEN distance >= 4 AND distance2 < 12 THEN 'zone3'
      WHEN some_bool AND distance >= 4 THEN 'zone4'
    END AS this_zone,
    COUNT(*) AS my_count
  FROM t1
  GROUP BY 1,2,3
) B
USING (team_id, game_id, this_zone)

Above is generic enough and has no dependency on how complex or not your logic - you just generate all expected rows (sub-query A) and left join it on your original query - that's all!

BigQuery: Computing aggregate over window of time for each person

Here is an efficient succinct way to do it that exploits the ordered structure of timestamps.

SELECT
  user,
  MAX(per_hour) AS max_event_per_hour
FROM
(
  SELECT 
    user,
    COUNT(*) OVER (PARTITION BY user ORDER BY timestamp RANGE BETWEEN 60 * 60 * 1000000 PRECEDING AND CURRENT ROW) as per_hour,
    timestamp
  FROM 
    [dataset_example_in_question_user_timestamps]
)
GROUP BY user

better way to do a rolling aggregation in big query?

The first limitation is you need to hardcode the value 259200 (3 days), you're unable to input a calculation such as ((3600 * 24) * 3)

Below is using just 3 days

#standardSQL
WITH weekly_agg AS (
  SELECT 
    * ,
    DATE_DIFF(event_date, '2000-01-01', DAY) AS day
  FROM `test.window_test`
  ORDER BY event_date
)
SELECT 
  country,
  event_date,
  SUM(value) OVER(PARTITION BY country ORDER BY day RANGE BETWEEN 3 PRECEDING AND CURRENT ROW) AS rolling
FROM weekly_agg

is there a way of using dates instead of range between numbers?

if you will be using dates instead of range - this would be some other logic (not a rolling aggregation) - something like simple grouping - for example

SELECT 
  country,
  event_date,
  SUM(value) 
FROM weekly_agg  
WHERE event_date BETWEEN <date1> AND <date2>
GROUP BY country, event_date

but that is most likely not what you want ...

Bigquery: How to Group and Count Rows Within Rolling Timestamp Window