Querying Count on Daily Basis with Date Constraints Over Multiple Weeks

Querying count on daily basis with date constraints over multiple weeks

Here is a long answer how to make your queries short. :)

Table

Building on my table (before you provided table definition with different (odd!) data types:

CREATE TABLE requests (
id int
, accounts_id int -- (id of the user)
, recipient_id int -- (id of the recipient)
, date date -- (date that the request was made in YYYY-MM-DD)
, amount int -- (# of requests by accounts_id for the day)
);

Active Users for given day

The list of "active users" for one given day:

SELECT accounts_id
FROM (
SELECT w.w, r.accounts_id
FROM (
SELECT w
, day - 6 - 7 * w AS w_start
, day - 7 * w AS w_end
FROM (SELECT '2014-10-31'::date - 1 AS day) d -- effective date here
, generate_series(0,3) w
) w
JOIN requests r ON r."date" BETWEEN w_start AND w_end
GROUP BY w.w, r.accounts_id
HAVING sum(r.amount) > 10
) sub
GROUP BY 1
HAVING count(*) = 4;

Step 1

In the innermost subquery w (for "week") build the bounds of the 4 weeks of interest from a CROSS JOIN of the given day - 1 with the output of generate_series(0-3).

To add / subtract days to / from a date (not from a timestamp!) just add / subtract integer numbers. The expression day - 7 * w subtracts 0-3 times 7 days from the given date, arriving at the end dates for each week (w_end).

Subrtract another 6 days (not 7!) from each to compute the respective start (w_start).

Additionally, keep the week number w (0-3) for the later aggregation.

Step 2

In subquery sub join rows from requests to the set of 4 weeks, where the date lies between start and end date. GROUP BY the week number w and the accounts_id.

Only weeks with more than 10 requests total qualify.

Step 3

In the outer SELECT count the number of weeks each user (accounts_id) qualified. Must be 4 to qualify as "active user"

Count of active users per day

This is dynamite.

Wrapped in in a simple SQL function to simplify general use, but the query can be used on its own just as well:

CREATE FUNCTION f_active_users (_now date = now()::date, _days int = 3)
RETURNS TABLE (day date, users int) AS
$func$
WITH r AS (
SELECT accounts_id, date, sum(amount)::int AS amount
FROM requests
WHERE date BETWEEN _now - (27 + _days) AND _now - 1
GROUP BY accounts_id, date
)
SELECT date + 1, count(w_ct = 4 OR NULL)::int
FROM (
SELECT accounts_id, date
, count(w_amount > 10 OR NULL)
OVER (PARTITION BY accounts_id, dow ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 3 FOLLOWING) AS w_ct
FROM (
SELECT accounts_id, date, dow
, sum(amount) OVER (PARTITION BY accounts_id ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND 6 FOLLOWING) AS w_amount
FROM (SELECT _now - i AS date, i%7 AS dow
FROM generate_series(1, 27 + _days) i) d -- period of interest
CROSS JOIN (
SELECT accounts_id FROM r
GROUP BY 1
HAVING count(*) > 3 AND sum(amount) > 39 -- enough rows & requests
AND max(date) > min(date) + 15) a -- can cover 4 weeks
LEFT JOIN r USING (accounts_id, date)
) sub1
WHERE date > _now - (22 + _days) -- cut off 6 trailing days now - useful?
) sub2
GROUP BY date
ORDER BY date DESC
LIMIT _days
$func$ LANGUAGE sql STABLE;

The function takes any day (_now), "today" by default, and the number of days (_days) in the result, 3 by default. Call:

SELECT * FROM f_active_users('2014-10-31', 5);

Or without parameters to use defaults:

SELECT * FROM f_active_users();

The approach is different from the first query.

SQL Fiddle with both queries and variants for your table definition.

Step 0

In the CTE r pre-aggregate amounts per (accounts_id, date) for only the period of interest, for better performance. The table is only scanned once, the suggested index (see blow) will kick in here.

Step 1

In the inner subquery d generate the necessary list of days: 27 + _days rows, where _days is the desired number of rows in the output, effectively 28 days or more.

While being at it, compute the day of the week (dow) to be used for aggregating in step 3. i%7 coincides with weekly intervals, the query works for any interval, though.

In the inner subquery a generate a unique list of users (accounts_id) that exist in CTE r and pass some first superficial tests (sufficient rows spanning sufficient time with sufficient total requests).

Step 2

Generate a Cartesian product from d and a with a CROSS JOIN to have one row for every relevant day for every relevant user. LEFT JOIN to r to append the amount of requests (if any). No WHERE condition, we want every day in the result, even if there are no active users at all.

Compute the total amount for the past week (w_amount) in the same step using a Window functions with a custom frame. Example:

  • How to use a ring data structure in window functions

Step 3

Cut off the last 6 days now; which is optional and may or may not help performance. Test it: WHERE date >= _now - (21 + _days)

Count the weeks where the minimum amount is met (w_ct) in a similar window function, this time partitioned by dow additionally to only have same weekdays for the past 4 weeks in the frame (which carry the sum of the respective past week).
The expression count(w_amount > 10 OR NULL) only counts rows with more than 10 requests. Detailed explanation:

  • Compute percents from SUM() in the same SELECT sql query

Step 4

In the outer SELECT group by date and count users that passed all 4 weeks (count(w_ct = 4 OR NULL)). Add 1 to the date to compensate off-by-1, ORDER and LIMIT to the requested number of days.

Performance and outlook

The perfect index for both queries would be:

CREATE INDEX foo ON requests (date, accounts_id, amount);

Performance should be good, but get even (much) better with the upcoming Postgres 9.4, due to the new moving aggregate support:

Moving-aggregate support in the Postgres Wiki.

Moving aggregates in the 9.4 manual

Aside: don't call a timestamp column "date", it's a timestamp, not a date. Better yet, never use basic type names like date or timestamp as identifier. Ever.

Query to find all timestamps more than a certain interval apart

To start a new session after every gap >= 1 hour:

SELECT user_id, count(*) AS distinct_sessions
FROM (
SELECT user_id
,(lag(request_time, 1, '-infinity') OVER (PARTITION BY user_id
ORDER BY request_time)
<= request_time - '1h'::interval) AS step -- start new session
FROM tbl
) sub
WHERE step
GROUP BY user_id
ORDER BY user_id;

Assuming request_time NOT NULL.

Explain:

  • In subquery sub, check for every row if a new session begins. Using the third parameter of lag() to provide the default -infinity, which is lower than any timestamp and therefore always starts a new session for the first row.

  • In the outer query count how many times new sessions started. Eliminate step = FALSE and count per user.

Alternative interpretation

If you really wanted to count hours where at least one request happened (I don't think you do, but another answer assumes as much), you would:

SELECT user_id
, count(DISTINCT date_trunc('hour', request_time)) AS hours_with_req
FROM tbl
GROUP BY 1
ORDER BY 1;

count number of records per user between two time stamps

I'm sure you have column with this date information. You can then use:

SELECT
DATE(your_date_column),
COUNT(userid) AS total
FROM tablename
WHERE userid='$userid'
AND your_date_column BETWEEN (NOW() - INTERVAL 1 WEEK) AND NOW()
GROUP BY DATE(your_date_column)

If you only want data of complete days then you've got to tweak the query a little bit, i.e. to exclude the first day of this interval.

If you want the totals of this week only without counts on daily basis you can use

SELECT
userid,
COUNT(userid) AS total
FROM tablename
WHERE userid='$userid'
AND your_date_column BETWEEN (NOW() - INTERVAL 1 WEEK) AND NOW()
GROUP BY userid;

Note

Please have a look the red box for the mysql_ extension and consider moving to mysqli or PDO. It's much better to use prepared statements with placeholders to bind your input values to than concatenation of sql strings.

Determine Whether Two Date Ranges Overlap

(StartA <= EndB) and (EndA >= StartB)

Proof:

Let ConditionA Mean that DateRange A Completely After DateRange B

_                        |---- DateRange A ------|
|---Date Range B -----| _

(True if StartA > EndB)

Let ConditionB Mean that DateRange A is Completely Before DateRange B

|---- DateRange A -----|                        _ 
_ |---Date Range B ----|

(True if EndA < StartB)

Then Overlap exists if Neither A Nor B is true -

(If one range is neither completely after the other,

nor completely before the other,
then they must overlap.)

Now one of De Morgan's laws says that:

Not (A Or B) <=> Not A And Not B

Which translates to: (StartA <= EndB) and (EndA >= StartB)


NOTE: This includes conditions where the edges overlap exactly. If you wish to exclude that,

change the >= operators to >, and <= to <


NOTE2. Thanks to @Baodad, see this blog, the actual overlap is least of:

{ endA-startA, endA - startB, endB-startA, endB - startB }

(StartA <= EndB) and (EndA >= StartB)
(StartA <= EndB) and (StartB <= EndA)


NOTE3. Thanks to @tomosius, a shorter version reads:

DateRangesOverlap = max(start1, start2) < min(end1, end2)

This is actually a syntactical shortcut for what is a longer implementation, which includes extra checks to verify that the start dates are on or before the endDates. Deriving this from above:

If start and end dates can be out of order, i.e., if it is possible that startA > endA or startB > endB, then you also have to check that they are in order, so that means you have to add two additional validity rules:

(StartA <= EndB) and (StartB <= EndA) and (StartA <= EndA) and (StartB <= EndB)
or:

(StartA <= EndB) and (StartA <= EndA) and (StartB <= EndA) and (StartB <= EndB)
or,

(StartA <= Min(EndA, EndB) and (StartB <= Min(EndA, EndB))
or:

(Max(StartA, StartB) <= Min(EndA, EndB)

But to implement Min() and Max(), you have to code, (using C ternary for terseness),:

((StartA > StartB) ? StartA : StartB) <= ((EndA < EndB) ? EndA : EndB)

compute sum of values associated with overlapping date ranges

Here is my attempt to solve this problem:

select y,
sum( hrs_per_week )
from tmp_ranges t
join(
select daterange( x,
lead(x) over (order by x) ) As y
from (
select lower( rng ) As x
from tmp_ranges
union
select upper( rng )
from tmp_ranges
order by x
) y
) y
on t.rng && y.y
group by y
order by y

Demo: http://sqlfiddle.com/#!15/ef6cb/13

The innermost subquery collects all boundary dates into one set using union, then sorts them.

Then the outer subquery builds new ranges from adjacent dates using lead function.

In the end, these new ranges are joined to the source table in the main query, aggregated, and sum is calculated.


EDIT

The order by clause in the innermost query is redundant and can be skipped, because lead(x) over caluse orders records by dates, and a resultset from the innermost subquery doesn't have to be sorted.

Find day of week with most hires

This works for me:

select DAY, cnt
from (SELECT TO_CHAR(HIRE_DATE,'DAY') DAY
,count(*) cnt
FROM EMPLOYEES
GROUP BY TO_CHAR(HIRE_DATE,'DAY'))
where cnt = (select max(cnt)
from (SELECT TO_CHAR(HIRE_DATE,'DAY') DAY
,count(*) cnt
FROM EMPLOYEES
GROUP BY TO_CHAR(HIRE_DATE,'DAY')))

Produces following results















DAYCNT
Tuesday3

Iterating over timestamp row values within a column and performing a countif in SQL?

Below is for BigQuery Standard SQL

#standardSQL
with dateseries as (
select date
from unnest(GENERATE_TIMESTAMP_ARRAY('2020-06-01', CURRENT_TIMESTAMP(), INTERVAL 1 DAY)) AS date
)
select date, count(1) `count`
from `project.dataset.dates`
join dateseries
on date between first_day and last_day
group by date

Note: first_day and last_day columns in project.dataset.dates assumed to be of timestamp data type. Otherwise (if they are strings) - use below line

on date between timestamp(first_day) and timestamp(last_day)


Related Topics



Leave a reply



Submit