Query for Count of Distinct Values in a Rolling Date Range

Query for count of distinct values in a rolling date range and country

The reason that country doesn't exist is that we can only SELECT columns from the tables that are in a FROM section of a query. The nested subquery selects from tbl but that is not available to the main, outer query, which only selects from the generated table g. g only has a date column, so that's the only column the outer query can select on directly.

Another issue with the query is that the COUNT is not factoring in countries.

For this I'd use basic join to query every row for every date range, then do a count distinct for every date + country. You can use an INNER join to remove days that have no entries or LEFT OUTER to return rows of {date}, nil,nil if there are no entries for that date range. Something like:

SELECT g.date
, tbl.country
, COUNT(DISTINCT(tbl.email))
FROM (SELECT generate_series(timestamp '2012-01-01'
, timestamp '2012-01-06'
, interval '1 day')::date) AS g(date)
INNER JOIN tbl ON (tbl.date BETWEEN g.date - 2 AND g.date)

GROUP BY 1,2

How can I count distinct values of certain attributes within a date range?

You need a LEFT join of county to fruit (so that you get a result with 0 for counties that do not meet the conditions) and group by county.

Also, the condition about the date should be placed in the ON clause:

SELECT c.county_name,
COUNT(DISTINCT f.fruit_code) total
FROM county c LEFT JOIN fruit f
ON f.county_code = c.county_code AND date BETWEEN '2020-01-01' AND '2021-01-01'
GROUP BY c.county_code;

I assume that county_code is the primary key of county.

See the demo.

Count of unique values in a rolling date range for R

Here's something that works, taking advantage of the new non-equijoins feature of data.table.

dt[dt[ , .(date3=date, date2 = date - 2, email)], 
on = .(date >= date2, date<=date3),
allow.cartesian = TRUE
][ , .(count = uniqueN(email)),
by = .(date = date + 2)]
# date V1
# 1: 2011-12-30 3
# 2: 2011-12-31 3
# 3: 2012-01-01 3
# 4: 2012-01-02 3
# 5: 2012-01-03 1
# 6: 2012-01-04 2

To be honest I'm a bit miffed on how this is working exactly, but the idea is to join dt to itself on date, matching any date that is between 2 days ago and today. I'm not sure why we have to clean up by setting date = date + 2 afterwards.


Here's an approach using keys:

setkey(dt, date)

dt[ , .(count = dt[.(seq.Date(.BY$date - 2L, .BY$date, "day")),
uniqueN(email), nomatch = 0L]), by = date]

BigQuery distinct count in rolling date range, with partition on column

Below is for BigQuery Standard SQL

#standardSQL
SELECT * EXCEPT(ids),
(SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
FROM `project.dataset.table`
WINDOW preceding_days AS (
PARTITION BY email
ORDER BY UNIX_DATE(DATE(day))
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
)
) t

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
SELECT 'user1@gmail.com' email, TIMESTAMP'2020-06-21 16:31:00 UTC' day, '...' other_cols, 0 id UNION ALL
SELECT 'user1@gmail.com', '2020-06-22 14:54:00 UTC', '...', 1 UNION ALL
SELECT 'user1@gmail.com', '2020-06-23 08:23:00 UTC', '...', 0 UNION ALL
SELECT 'user1@gmail.com', '2020-06-24 13:51:00 UTC', '...', 0 UNION ALL
SELECT 'user1@gmail.com', '2020-06-25 09:54:00 UTC', '...', 2 UNION ALL
SELECT 'user1@gmail.com', '2020-06-25 12:25:00 UTC', '...', 0 UNION ALL
SELECT 'user1@gmail.com', '2020-06-26 15:21:00 UTC', '...', 2 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-21 12:23:00 UTC', '...', 0 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-21 16:54:00 UTC', '...', 0 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-22 08:23:00 UTC', '...', 0 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-22 12:13:00 UTC', '...', 1 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-24 09:32:00 UTC', '...', 1 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-25 05:45:00 UTC', '...', 1 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-26 12:32:00 UTC', '...', 2 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-27 19:53:00 UTC', '...', 1
)
SELECT * EXCEPT(ids),
(SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
FROM `project.dataset.table`
WINDOW preceding_days AS (
PARTITION BY email
ORDER BY UNIX_DATE(DATE(day))
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
)
) t

rolling count of distinct users

Using functions from dplyr and tidyr, for the 1-day window case:

have %>% 
group_by(when) %>%
summarise(twoDayCount = n_distinct(user))

For larger windows:

window <- 2
have %>%
rowwise() %>%
mutate(when = list(when + lubridate::days(0:(window - 1)))) %>%
unnest(cols = when) %>%
group_by(when) %>%
summarise(twoDayCount = n_distinct(user))

Note that this method will give you rows for a few later dates (in this case Jan 08), which you might want to remove.

If performance is an issue for larger datasets, here is a much faster (but slightly less elegant) solution:

window <- 2
seq.Date(min(have$when), max(have$when), by = "day") %>%
purrr::map(function(date) {
have %>%
filter(when <= date, when >= date - days(window - 1)) %>%
summarise(userCount = n_distinct(user)) %>%
mutate(when = date)
}) %>%
bind_rows()

BigQuery for running count of distinct values with a dynamic date-range

Here is a way to achieve your desired results. Note that this query does year-month join in a separate query, and joins it with the rolling 364-day-interval query.

with year_month_distincts as (
select
concat(
cast(extract(year from order_date) as string),
'-',
cast(extract(month from order_date) as string)
) as year_month,
count(distinct customer_id) as ym_distincts
from customer_table
group by 1
)

select x.order_date, x.ytd_distincts, y.ym_distincts from (
select
a. order_date,
(select
count(distinct customer_id)
from customer_table b
where b.order_date between date_sub(a.order_date, interval 364 day) and a.order_date
) as ytd_distincts
from orders a
group by 1
) x
join year_month_distincts y on concat(
cast(extract(year from x.order_date) as string),
'-',
cast(extract(month from x.order_date) as string)
) = y.year_month

How to have the rolling distinct count of each day for past three days in Oracle SQL?

What you seem to want is:

select date,
count(distinct customer) over (order by date rows between 2 preceding and current row)
from (select distinct trunc(datetime) as date, customer
from t
) t
group by date;

However, Oracle does not support window frames with count(distinct).

One rather brute force approach is a correlated subquery:

select date,
(select count(distinct t2.customer)
from t t2
where t2.datetime >= t.date - 2
) as running_3
from (select distinct trunc(datetime) as date
from t
) t;

This should have reasonable performance for a small number of dates. As the number of dates increases, the performance will degrade linearly.

MySQL query to retrieve DISTINCT COUNT between moving DATE period

This query should work for you:

SELECT
DATE_FORMAT(d1.Started, '%Y-%m-%d') AS Started,
COUNT(DISTINCT d2.UserID) Users
FROM
(
SELECT
DATE(Started) AS Started
FROM
Session
GROUP BY
DATE(Started)
) d1
INNER JOIN
(
SELECT DISTINCT
DATE(Started) AS Started,
UserID
FROM
Session
) d2
ON d2.Started BETWEEN d1.Started - INTERVAL 7 DAY AND d1.Started
GROUP BY
d1.Started
ORDER BY
d1.Started DESC

Visit http://sqlfiddle.com/#!2/9339c/5 to see this query in action.



Related Topics



Leave a reply



Submit