Query for count of distinct values in a rolling date range and country
The reason that country
doesn't exist is that we can only SELECT
columns from the tables that are in a FROM
section of a query. The nested subquery selects from tbl
but that is not available to the main, outer query, which only selects from the generated table g
. g
only has a date
column, so that's the only column the outer query can select on directly.
Another issue with the query is that the COUNT
is not factoring in countries.
For this I'd use basic join to query every row for every date range, then do a count distinct
for every date + country. You can use an INNER
join to remove days that have no entries or LEFT OUTER
to return rows of {date}, nil,nil
if there are no entries for that date range. Something like:
SELECT g.date
, tbl.country
, COUNT(DISTINCT(tbl.email))
FROM (SELECT generate_series(timestamp '2012-01-01'
, timestamp '2012-01-06'
, interval '1 day')::date) AS g(date)
INNER JOIN tbl ON (tbl.date BETWEEN g.date - 2 AND g.date)
GROUP BY 1,2
How can I count distinct values of certain attributes within a date range?
You need a LEFT
join of county
to fruit
(so that you get a result with 0
for counties that do not meet the conditions) and group by county.
Also, the condition about the date should be placed in the ON
clause:
SELECT c.county_name,
COUNT(DISTINCT f.fruit_code) total
FROM county c LEFT JOIN fruit f
ON f.county_code = c.county_code AND date BETWEEN '2020-01-01' AND '2021-01-01'
GROUP BY c.county_code;
I assume that county_code
is the primary key of county
.
See the demo.
Count of unique values in a rolling date range for R
Here's something that works, taking advantage of the new non-equijoins feature of data.table
.
dt[dt[ , .(date3=date, date2 = date - 2, email)],
on = .(date >= date2, date<=date3),
allow.cartesian = TRUE
][ , .(count = uniqueN(email)),
by = .(date = date + 2)]
# date V1
# 1: 2011-12-30 3
# 2: 2011-12-31 3
# 3: 2012-01-01 3
# 4: 2012-01-02 3
# 5: 2012-01-03 1
# 6: 2012-01-04 2
To be honest I'm a bit miffed on how this is working exactly, but the idea is to join dt
to itself on date
, matching any date
that is between 2 days ago and today. I'm not sure why we have to clean up by setting date = date + 2
afterwards.
Here's an approach using keys:
setkey(dt, date)
dt[ , .(count = dt[.(seq.Date(.BY$date - 2L, .BY$date, "day")),
uniqueN(email), nomatch = 0L]), by = date]
BigQuery distinct count in rolling date range, with partition on column
Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(ids),
(SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
FROM `project.dataset.table`
WINDOW preceding_days AS (
PARTITION BY email
ORDER BY UNIX_DATE(DATE(day))
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
)
) t
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'user1@gmail.com' email, TIMESTAMP'2020-06-21 16:31:00 UTC' day, '...' other_cols, 0 id UNION ALL
SELECT 'user1@gmail.com', '2020-06-22 14:54:00 UTC', '...', 1 UNION ALL
SELECT 'user1@gmail.com', '2020-06-23 08:23:00 UTC', '...', 0 UNION ALL
SELECT 'user1@gmail.com', '2020-06-24 13:51:00 UTC', '...', 0 UNION ALL
SELECT 'user1@gmail.com', '2020-06-25 09:54:00 UTC', '...', 2 UNION ALL
SELECT 'user1@gmail.com', '2020-06-25 12:25:00 UTC', '...', 0 UNION ALL
SELECT 'user1@gmail.com', '2020-06-26 15:21:00 UTC', '...', 2 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-21 12:23:00 UTC', '...', 0 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-21 16:54:00 UTC', '...', 0 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-22 08:23:00 UTC', '...', 0 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-22 12:13:00 UTC', '...', 1 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-24 09:32:00 UTC', '...', 1 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-25 05:45:00 UTC', '...', 1 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-26 12:32:00 UTC', '...', 2 UNION ALL
SELECT 'user2@hotmail.com', '2020-06-27 19:53:00 UTC', '...', 1
)
SELECT * EXCEPT(ids),
(SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
FROM `project.dataset.table`
WINDOW preceding_days AS (
PARTITION BY email
ORDER BY UNIX_DATE(DATE(day))
ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
)
) t
rolling count of distinct users
Using functions from dplyr
and tidyr
, for the 1-day window case:
have %>%
group_by(when) %>%
summarise(twoDayCount = n_distinct(user))
For larger windows:
window <- 2
have %>%
rowwise() %>%
mutate(when = list(when + lubridate::days(0:(window - 1)))) %>%
unnest(cols = when) %>%
group_by(when) %>%
summarise(twoDayCount = n_distinct(user))
Note that this method will give you rows for a few later dates (in this case Jan 08), which you might want to remove.
If performance is an issue for larger datasets, here is a much faster (but slightly less elegant) solution:
window <- 2
seq.Date(min(have$when), max(have$when), by = "day") %>%
purrr::map(function(date) {
have %>%
filter(when <= date, when >= date - days(window - 1)) %>%
summarise(userCount = n_distinct(user)) %>%
mutate(when = date)
}) %>%
bind_rows()
BigQuery for running count of distinct values with a dynamic date-range
Here is a way to achieve your desired results. Note that this query does year-month join in a separate query, and joins it with the rolling 364-day-interval query.
with year_month_distincts as (
select
concat(
cast(extract(year from order_date) as string),
'-',
cast(extract(month from order_date) as string)
) as year_month,
count(distinct customer_id) as ym_distincts
from customer_table
group by 1
)
select x.order_date, x.ytd_distincts, y.ym_distincts from (
select
a. order_date,
(select
count(distinct customer_id)
from customer_table b
where b.order_date between date_sub(a.order_date, interval 364 day) and a.order_date
) as ytd_distincts
from orders a
group by 1
) x
join year_month_distincts y on concat(
cast(extract(year from x.order_date) as string),
'-',
cast(extract(month from x.order_date) as string)
) = y.year_month
How to have the rolling distinct count of each day for past three days in Oracle SQL?
What you seem to want is:
select date,
count(distinct customer) over (order by date rows between 2 preceding and current row)
from (select distinct trunc(datetime) as date, customer
from t
) t
group by date;
However, Oracle does not support window frames with count(distinct)
.
One rather brute force approach is a correlated subquery:
select date,
(select count(distinct t2.customer)
from t t2
where t2.datetime >= t.date - 2
) as running_3
from (select distinct trunc(datetime) as date
from t
) t;
This should have reasonable performance for a small number of dates. As the number of dates increases, the performance will degrade linearly.
MySQL query to retrieve DISTINCT COUNT between moving DATE period
This query should work for you:
SELECT
DATE_FORMAT(d1.Started, '%Y-%m-%d') AS Started,
COUNT(DISTINCT d2.UserID) Users
FROM
(
SELECT
DATE(Started) AS Started
FROM
Session
GROUP BY
DATE(Started)
) d1
INNER JOIN
(
SELECT DISTINCT
DATE(Started) AS Started,
UserID
FROM
Session
) d2
ON d2.Started BETWEEN d1.Started - INTERVAL 7 DAY AND d1.Started
GROUP BY
d1.Started
ORDER BY
d1.Started DESC
Visit http://sqlfiddle.com/#!2/9339c/5 to see this query in action.
Related Topics
Convert a String to Int Using SQL Query
Update Multiple Columns in SQL
How to Insert Table Values from One Database to Another Database
What Is the Most Appropriate Data Type for Storing an Ip Address in SQL Server
SQL Join on Multiple Columns in Same Tables
When to Use SQL Sub-Queries Versus a Standard Join
How to Batch SQL Statements with Package Database/Sql
Postgresql Query Between Date Ranges
Backup a Single Table with Its Data from a Database in SQL Server 2008
MySQL 'Create Schema' and 'Create Database' - Is There Any Difference
Primary Key/Foreign Key Naming Convention
SQL Script to Alter All Foreign Keys to Add on Delete Cascade
How to Drop a Default Value or Similar Constraint in T-Sql
SQL Like Condition to Check for Integer