Querying count on daily basis with date constraints over multiple weeks
Here is a long answer how to make your queries short. :)
Table
Building on my table (before you provided table definition with different (odd!) data types:
CREATE TABLE requests (
id int
, accounts_id int -- (id of the user)
, recipient_id int -- (id of the recipient)
, date date -- (date that the request was made in YYYY-MM-DD)
, amount int -- (# of requests by accounts_id for the day)
);
Active Users for given day
The list of "active users" for one given day:
SELECT accounts_id
FROM (
SELECT w.w, r.accounts_id
FROM (
SELECT w
, day - 6 - 7 * w AS w_start
, day - 7 * w AS w_end
FROM (SELECT '2014-10-31'::date - 1 AS day) d -- effective date here
, generate_series(0,3) w
) w
JOIN requests r ON r."date" BETWEEN w_start AND w_end
GROUP BY w.w, r.accounts_id
HAVING sum(r.amount) > 10
) sub
GROUP BY 1
HAVING count(*) = 4;
Step 1
In the innermost subquery w
(for "week") build the bounds of the 4 weeks of interest from a CROSS JOIN
of the given day - 1 with the output of generate_series(0-3)
.
To add / subtract days to / from a date
(not from a timestamp!) just add / subtract integer
numbers. The expression day - 7 * w
subtracts 0-3 times 7 days from the given date, arriving at the end dates for each week (w_end
).
Subrtract another 6 days (not 7!) from each to compute the respective start (w_start
).
Additionally, keep the week number w
(0-3) for the later aggregation.
Step 2
In subquery sub
join rows from requests
to the set of 4 weeks, where the date lies between start and end date. GROUP BY
the week number w
and the accounts_id
.
Only weeks with more than 10 requests total qualify.
Step 3
In the outer SELECT
count the number of weeks each user (accounts_id
) qualified. Must be 4 to qualify as "active user"
Count of active users per day
This is dynamite.
Wrapped in in a simple SQL function to simplify general use, but the query can be used on its own just as well:
CREATE FUNCTION f_active_users (_now date = now()::date, _days int = 3)
RETURNS TABLE (day date, users int) AS
$func$
WITH r AS (
SELECT accounts_id, date, sum(amount)::int AS amount
FROM requests
WHERE date BETWEEN _now - (27 + _days) AND _now - 1
GROUP BY accounts_id, date
)
SELECT date + 1, count(w_ct = 4 OR NULL)::int
FROM (
SELECT accounts_id, date
, count(w_amount > 10 OR NULL)
OVER (PARTITION BY accounts_id, dow ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS w_ct
FROM (
SELECT accounts_id, date, dow
, sum(amount) OVER (PARTITION BY accounts_id ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING) AS w_amount
FROM (SELECT _now - i AS date, i%7 AS dow
FROM generate_series(1, 27 + _days) i) d -- period of interest
CROSS JOIN (
SELECT accounts_id FROM r
GROUP BY 1
HAVING count(*) > 3 AND sum(amount) > 39 -- enough rows & requests
AND max(date) > min(date) + 15) a -- can cover 4 weeks
LEFT JOIN r USING (accounts_id, date)
) sub1
WHERE date > _now - (22 + _days) -- cut off 6 trailing days now - useful?
) sub2
GROUP BY date
ORDER BY date DESC
LIMIT _days
$func$ LANGUAGE sql STABLE;
The function takes any day (_now
), "today" by default, and the number of days (_days
) in the result, 3 by default. Call:
SELECT * FROM f_active_users('2014-10-31', 5);
Or without parameters to use defaults:
SELECT * FROM f_active_users();
The approach is different from the first query.
SQL Fiddle with both queries and variants for your table definition.
Step 0
In the CTE r
pre-aggregate amounts per (accounts_id, date)
for only the period of interest, for better performance. The table is only scanned once, the suggested index (see blow) will kick in here.
Step 1
In the inner subquery d
generate the necessary list of days: 27 + _days
rows, where _days
is the desired number of rows in the output, effectively 28 days or more.
While being at it, compute the day of the week (dow
) to be used for aggregating in step 3. i%7
coincides with weekly intervals, the query works for any interval, though.
In the inner subquery a
generate a unique list of users (accounts_id
) that exist in CTE r
and pass some first superficial tests (sufficient rows spanning sufficient time with sufficient total requests).
Step 2
Generate a Cartesian product from d
and a
with a CROSS JOIN
to have one row for every relevant day for every relevant user. LEFT JOIN
to r
to append the amount of requests (if any). No WHERE
condition, we want every day in the result, even if there are no active users at all.
Compute the total amount for the past week (w_amount
) in the same step using a Window functions with a custom frame. Example:
- How to use a ring data structure in window functions
Step 3
Cut off the last 6 days now; which is optional and may or may not help performance. Test it: WHERE date >= _now - (21 + _days)
Count the weeks where the minimum amount is met (w_ct
) in a similar window function, this time partitioned by dow
additionally to only have same weekdays for the past 4 weeks in the frame (which carry the sum of the respective past week).
The expression count(w_amount > 10 OR NULL)
only counts rows with more than 10 requests. Detailed explanation:
- Compute percents from SUM() in the same SELECT sql query
Step 4
In the outer SELECT
group by date
and count users that passed all 4 weeks (count(w_ct = 4 OR NULL)
). Add 1 to the date to compensate off-by-1, ORDER
and LIMIT
to the requested number of days.
Performance and outlook
The perfect index for both queries would be:
CREATE INDEX foo ON requests (date, accounts_id, amount);
Performance should be good, but get even (much) better with the upcoming Postgres 9.4, due to the new moving aggregate support:
Moving-aggregate support in the Postgres Wiki.
Moving aggregates in the 9.4 manual
Aside: don't call a timestamp
column "date", it's a timestamp
, not a date
. Better yet, never use basic type names like date
or timestamp
as identifier. Ever.
Query to find all timestamps more than a certain interval apart
To start a new session after every gap >= 1 hour:
SELECT user_id, count(*) AS distinct_sessions
FROM (
SELECT user_id
,(lag(request_time, 1, '-infinity') OVER (PARTITION BY user_id
ORDER BY request_time)
<= request_time - '1h'::interval) AS step -- start new session
FROM tbl
) sub
WHERE step
GROUP BY user_id
ORDER BY user_id;
Assuming request_time NOT NULL
.
Explain:
In subquery
sub
, check for every row if a new session begins. Using the third parameter oflag()
to provide the default-infinity
, which is lower than any timestamp and therefore always starts a new session for the first row.In the outer query count how many times new sessions started. Eliminate
step = FALSE
and count per user.
Alternative interpretation
If you really wanted to count hours where at least one request happened (I don't think you do, but another answer assumes as much), you would:
SELECT user_id
, count(DISTINCT date_trunc('hour', request_time)) AS hours_with_req
FROM tbl
GROUP BY 1
ORDER BY 1;
count number of records per user between two time stamps
I'm sure you have column with this date information. You can then use:
SELECT
DATE(your_date_column),
COUNT(userid) AS total
FROM tablename
WHERE userid='$userid'
AND your_date_column BETWEEN (NOW() - INTERVAL 1 WEEK) AND NOW()
GROUP BY DATE(your_date_column)
If you only want data of complete days then you've got to tweak the query a little bit, i.e. to exclude the first day of this interval.
If you want the totals of this week only without counts on daily basis you can use
SELECT
userid,
COUNT(userid) AS total
FROM tablename
WHERE userid='$userid'
AND your_date_column BETWEEN (NOW() - INTERVAL 1 WEEK) AND NOW()
GROUP BY userid;
Note
Please have a look the red box for the mysql_ extension and consider moving to mysqli or PDO. It's much better to use prepared statements with placeholders to bind your input values to than concatenation of sql strings.
Determine Whether Two Date Ranges Overlap
(StartA <= EndB) and (EndA >= StartB)
Proof:
Let ConditionA Mean that DateRange A Completely After DateRange B
_ |---- DateRange A ------|
|---Date Range B -----| _
(True if StartA > EndB
)
Let ConditionB Mean that DateRange A is Completely Before DateRange B
|---- DateRange A -----| _
_ |---Date Range B ----|
(True if EndA < StartB
)
Then Overlap exists if Neither A Nor B is true -
(If one range is neither completely after the other,
nor completely before the other,
then they must overlap.)
Now one of De Morgan's laws says that:
Not (A Or B)
<=> Not A And Not B
Which translates to: (StartA <= EndB) and (EndA >= StartB)
NOTE: This includes conditions where the edges overlap exactly. If you wish to exclude that,
change the >=
operators to >
, and <=
to <
NOTE2. Thanks to @Baodad, see this blog, the actual overlap is least of:
{ endA-startA
, endA - startB
, endB-startA
, endB - startB
}
(StartA <= EndB) and (EndA >= StartB)
(StartA <= EndB) and (StartB <= EndA)
NOTE3. Thanks to @tomosius, a shorter version reads:DateRangesOverlap = max(start1, start2) < min(end1, end2)
This is actually a syntactical shortcut for what is a longer implementation, which includes extra checks to verify that the start dates are on or before the endDates. Deriving this from above:
If start and end dates can be out of order, i.e., if it is possible that startA > endA
or startB > endB
, then you also have to check that they are in order, so that means you have to add two additional validity rules:(StartA <= EndB) and (StartB <= EndA) and (StartA <= EndA) and (StartB <= EndB)
or:(StartA <= EndB) and (StartA <= EndA) and (StartB <= EndA) and (StartB <= EndB)
or,(StartA <= Min(EndA, EndB) and (StartB <= Min(EndA, EndB))
or:(Max(StartA, StartB) <= Min(EndA, EndB)
But to implement Min()
and Max()
, you have to code, (using C ternary for terseness),:((StartA > StartB) ? StartA : StartB) <= ((EndA < EndB) ? EndA : EndB)
compute sum of values associated with overlapping date ranges
Here is my attempt to solve this problem:
select y,
sum( hrs_per_week )
from tmp_ranges t
join(
select daterange( x,
lead(x) over (order by x) ) As y
from (
select lower( rng ) As x
from tmp_ranges
union
select upper( rng )
from tmp_ranges
order by x
) y
) y
on t.rng && y.y
group by y
order by y
Demo: http://sqlfiddle.com/#!15/ef6cb/13
The innermost subquery collects all boundary dates into one set using union
, then sorts them.
Then the outer subquery builds new ranges from adjacent dates using lead
function.
In the end, these new ranges are joined to the source table in the main query, aggregated, and sum
is calculated.
EDIT
The order by
clause in the innermost query is redundant and can be skipped, because lead(x) over
caluse orders records by dates, and a resultset from the innermost subquery doesn't have to be sorted.
Find day of week with most hires
This works for me:
select DAY, cnt
from (SELECT TO_CHAR(HIRE_DATE,'DAY') DAY
,count(*) cnt
FROM EMPLOYEES
GROUP BY TO_CHAR(HIRE_DATE,'DAY'))
where cnt = (select max(cnt)
from (SELECT TO_CHAR(HIRE_DATE,'DAY') DAY
,count(*) cnt
FROM EMPLOYEES
GROUP BY TO_CHAR(HIRE_DATE,'DAY')))
Produces following results
DAY | CNT |
---|---|
Tuesday | 3 |
Iterating over timestamp row values within a column and performing a countif in SQL?
Below is for BigQuery Standard SQL
#standardSQL
with dateseries as (
select date
from unnest(GENERATE_TIMESTAMP_ARRAY('2020-06-01', CURRENT_TIMESTAMP(), INTERVAL 1 DAY)) AS date
)
select date, count(1) `count`
from `project.dataset.dates`
join dateseries
on date between first_day and last_day
group by date
Note: first_day and last_day columns in project.dataset.dates
assumed to be of timestamp data type. Otherwise (if they are strings) - use below line
on date between timestamp(first_day) and timestamp(last_day)
Related Topics
How to Get String After Character Oracle
Migrating Oracle Date Columns to Timestamp with Timezone
How to Dynamically Create Columns in SQL Select Statement
What Is The Meaning of Kanatype Sensitive Ks and Width Sensitive
How to Set a Default Value for One Column in SQL Based on Another Column
How to Set Numwidth in The Grid Output of Pl/Sql Developer
Conditional Unique Constraint with Multiple Fields in Oracle Db
How to Sum() Over Column with Reset Condition
Nesting Aggregate Functions - Sql
Making Ssdt Just Generate a SQL Script (And Not Deploy a Database)
Most Efficient Way to Query Multiple Identical Tables in Separate Databases
Postgresql Error: 42P01: Relation "[Table]" Does Not Exist
Oracle Text Escaping with Curly Braces and Wildcards
Select The Rows That Just Inserted