Group by Data Intervals

Group data in intervals

If you want the intervals to be calendar based -- i.e. four per hour starting at 0, 15, 30, and 45 minutes, then you can use:

select id, min(begin_date), max(begin_date)
from t
group by id, convert(date, begin_date),
datepart(hour, begin_date), datepart(minute, begin_date) / 15;

Note that begin date and end date have the same value, so I just used begin_date in this answer.

Grouping data based on time interval

You can groupby first then do a cumsum to get the participant column the way you want. Please make sure the time column is in datetime format and also sort it before you do this.

df['time'] = pd.to_datetime(df['time']) 
df['time_diff']=df.groupby(['tablet'])['time'].diff().dt.seconds/60
df['participant'] = np.where((df['time_diff'].isnull()) | (df['time_diff']>10), 1,0).cumsum()

Group by id and store time difference(intervals) into a list

Use summarise to store the data in a list.

library(dplyr)

d %>%
group_by(ID) %>%
summarise(Time_interval = list(as.numeric(na.omit(round(difftime(Time,
lag(Time), units = 'mins')))))) -> result

result
# A tibble: 2 x 2
# ID Time_interval
# <int> <list>
#1 1 <dbl [3]>
#2 2 <dbl [1]>

result$Time_interval

#[[1]]
#[1] 2 3 80

#[[2]]
#[1] 6

data

d <- structure(list(ID = c(1L, 2L, 1L, 1L, 2L, 1L), Time = structure(c(1581266398, 
1582134325, 1581266545, 1581266734, 1582134665, 1581271525), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -6L), class = "data.frame")

How to group data into arrays by intervals

Essentially the startOf function set to zero the specified date fields.

For grouping by 5 minutes you must get the current minutes and check if they are on the correct interval, to do this you get the current minutes / 5 * 5.

Replace 5 with 15 for 15 minutes.

Obviously this work starting interval from 0:

  • 0-4 = 0
  • 5-9 = 5

and so on

(result) => moment(result['localcol'], 'DD/MM/YYYY').minutes(Math.floor(moment(result['localcol'], 'DD/MM/YYYY').minutes() / 5) * 5);;
moment(date, 'DD/MM/YYYY').minutes(minutes);

Group by data intervals

WITH t AS (
SELECT ts, (random()*100)::int AS bandwidth
FROM generate_series('2012-09-01', '2012-09-04', '1 minute'::interval) ts
)

SELECT date_trunc('hour', ts) AS hour_stump
,(extract(minute FROM ts)::int / 15) AS min15_slot
,count(*) AS rows_in_timeslice -- optional
,sum(bandwidth) AS sum_bandwidth
FROM t
WHERE ts >= '2012-09-02 00:00:00+02'::timestamptz -- user's time range
AND ts < '2012-09-03 00:00:00+02'::timestamptz -- careful with borders
GROUP BY 1, 2
ORDER BY 1, 2;

The CTE t provides data like your table might hold: one timestamp ts per minute with a bandwidth number. (You don't need that part, you work with your table instead.)

Here is a very similar solution for a very similar question - with detailed explanation how this particular aggregation works:

  • date_trunc 5 minute interval in PostgreSQL

Here is a similar solution for a similar question concerning running sums - with detailed explanation and links for the various functions used:

  • PostgreSQL: running count of rows for a query 'by minute'

Additional question in comment

WITH -- same as above ...

SELECT DISTINCT ON (1,2)
date_trunc('hour', ts) AS hour_stump
,(extract(minute FROM ts)::int / 15) AS min15_slot
,bandwidth AS bandwith_sample_at_min15
FROM t
WHERE ts >= '2012-09-02 00:00:00+02'::timestamptz
AND ts < '2012-09-03 00:00:00+02'::timestamptz
ORDER BY 1, 2, ts DESC;

Retrieves one un-aggregated sample per 15 minute interval - from the last available row in the window. This will be the 15th minute if the row is not missing. Crucial parts are DISTINCT ON and ORDER BY.

More information about the used technique here:

  • Select first row in each GROUP BY group?

Is there a way to group timestamp data by 30 day intervals starting from the min(date) and add them as columns

If you are using BigQuery, I would recommend:

  • countif() to count a boolean value
  • timestamp_add() to add intervals to timestamps

The exact boundaries are a bit vague, but I would go for:

select pc.url,
countif(pv.date >= pc.dt_crtd and
pv.date < timestamp_add(pc.dt_crtd, interval 30 day
) as Interval_00_29,
countif(pv.date >= timestamp_add(pc.dt_crtd, interval 30 day) and
pv.date < timestamp_add(pc.dt_crtd, interval 60 day
) as Interval_30_59,
countif(pv.date >= timestamp_add(pc.dt_crtd, interval 60 day) and
pv.date < timestamp_add(pc.dt_crtd, interval 90 day
) as Interval_60_89
from page_creation pc join
page_visits pv
on pc.link = pv.url
group by pc.url

Group data by 15 days interval

You can use LAG function with offset parameter to find the date of the 2nd previous post, then calculate the date difference:

WITH questions AS (
SELECT OwnerUserId
, CreationDate AS PostDate
, LAG(CreationDate, 2) OVER (PARTITION BY OwnerUserId ORDER BY CreationDate) AS PrevDate
FROM Posts
WHERE OwnerUserId IS NOT NULL -- not community owned
AND PostTypeId = 1 -- questions only
AND CreationDate >= '2018-01-01' -- between 2018
AND CreationDate < '2019-01-01'
AND Tags LIKE '%<sql>%' -- tagged sql
)
SELECT *
FROM questions
WHERE DATEDIFF(DAY, PrevDate, PostDate) <= 14


Related Topics



Leave a reply



Submit