Aggregate 1-Minute Data into 5-Minute Average Data

aggregate 1-minute data into 5-minute average data

Using the dplyr package and assuming, your data is stored in a data frame named df:

require(dplyr)
df %>%
group_by(DeviceTime = cut(DeviceTime, breaks="5 min")) %>%
summarize(Concentration = mean(Concentration))

Use T-SQL window functions to retrieve 5-minute averages from 1-minute data

The gist of the problem is rounding datetime values to 5 minute boundary which (assuming that the datatype is datetime) could be done using DATEADD(MINUTE, DATEDIFF(MINUTE, 0, time) / 5 * 5, 0). Rest is basic grouping/window functions:

WITH cte AS (
SELECT clamped_time
, [Open]
, [Close]
, [High]
, [Low]
, [Volume]
, rn1 = ROW_NUMBER() OVER (PARTITION BY clamped_time ORDER BY [Time])
, rn2 = ROW_NUMBER() OVER (PARTITION BY clamped_time ORDER BY [Time] DESC)
FROM t
CROSS APPLY (
SELECT DATEADD(MINUTE, DATEDIFF(MINUTE, 0, time) / 5 * 5, 0)
) AS x(clamped_time)
)
SELECT clamped_time
, MIN(CASE WHEN rn1 = 1 THEN [Open] END) AS [Open]
, MIN(CASE WHEN rn2 = 1 THEN [Close] END) AS [Close]
, MAX([High]) AS [High]
, MIN([Low]) AS [Low]
, AVG([Volume])
FROM cte
GROUP BY clamped_time

Demo on db<>fiddle

How to get average value for each hourly increment that is split into 5 minute intervals

You didn't specify the type for interval_time, so I'm assuming a string, you can parse it out with a case statement like this:

SELECT interval_date,
CASE WHEN SUBSTRING(interval_time,4,2)='00' THEN interval_time
WHEN SUBSTRING(interval_time,1,2)='23' THEN '00:00'
ELSE FORMAT(convert(int,SUBSTRING(interval_time,1,2))+1,'00')+':00'
END interval_time,
AVG(power)
FROM mytable
WHERE on_status = 'Y'
GROUP BY interval_date,
CASE WHEN SUBSTRING(interval_time,4,2)='00' THEN interval_time
WHEN SUBSTRING(interval_time,1,2)='23' THEN '00:00'
ELSE FORMAT(convert(int,SUBSTRING(interval_time,1,2))+1,'00')+':00'
END

Note that to get your target of 5.17, I had to comment out the on_status = 'Y' filter.

https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=b604e4fe6696465aac75676e69b92a47

Aggregate time series data, get average without NA or 0s in R

The following uses cut to make of column V1 a grouping variable by 5 minutes intervals and then summarizes using a custom function to compute means without NA's or zero values. I have left this function in two code lines to make it more readable but it could simply be

f <- function(x) mean(x[x != 0], na.rm = TRUE)

The date time column V1 is first coerced to class "POSIXct".

library(data.table)

f <- function(x){
y <- x[x != 0]
mean(y, na.rm = TRUE)
}

df[, V1 := as.POSIXct(V1)]
df[, V1 := cut(V1, "5 mins")]
df[, lapply(.SD, f), by = V1]
# V1 423 470 473 626
#1: 2018-01-01 00:00:00 25.33333 28.00000 50.83333 31
#2: 2018-01-01 00:05:00 30.25722 24.00000 49.25000 21
#3: 2018-01-01 00:10:00 30.42409 21.50000 35.40000 75
#4: 2018-01-01 00:15:00 26.41851 25.16667 36.83333 48

A one-liner could be

df[, lapply(.SD, f), by = cut(as.POSIXct(V1), "5 mins")]

Group DataFrame in 5-minute intervals

You can use df.resample to do aggregation based on a date/time variable. You'll need a datetime index and you can specify that while reading the csv file:

df = pd.read_csv("filename.csv", parse_dates = [["DATE", "TIME"]], index_col=0)

This will result in a dataframe with an index where date and time are combined (source):

df.head()
Out[7]:
OPEN HIGH LOW CLOSE VOLUME
DATE_TIME
1997-02-03 09:04:00 3046.0 3048.5 3046.0 3047.5 505
1997-02-03 09:05:00 3047.0 3048.0 3046.0 3047.0 162
1997-02-03 09:06:00 3047.5 3048.0 3047.0 3047.5 98
1997-02-03 09:07:00 3047.5 3047.5 3047.0 3047.5 228
1997-02-03 09:08:00 3048.0 3048.0 3047.5 3048.0 136

After that you can use resample to get the sum, mean, etc. of those five minute intervals.

df.resample("5T").mean()
Out[8]:
OPEN HIGH LOW CLOSE VOLUME
DATE_TIME
1997-02-03 09:00:00 3046.0 3048.5 3046.0 3047.5 505.0
1997-02-03 09:05:00 3047.6 3047.9 3046.8 3047.3 159.6
1997-02-03 09:10:00 3045.6 3045.9 3044.8 3045.0 110.2
1997-02-03 09:15:00 3043.6 3044.0 3042.8 3043.2 69.2
1997-02-03 09:20:00 3044.7 3045.2 3044.5 3045.0 65.8
1997-02-03 09:25:00 3043.8 3044.0 3043.5 3043.7 59.0
1997-02-03 09:30:00 3044.6 3045.0 3044.3 3044.6 56.0
1997-02-03 09:35:00 3044.5 3044.5 3043.5 3044.5 44.0

(T is used for minute frequency. Here is a list of other units.)

Aggregate 10 minute interval data to hourly

This worked perfectly:
df.resample('60T').mean()

aggregate column values at 5 min intervals and create a new dataframe

First make sure the timestamp columns is a date.time column. You can skip this line if it already is in this format.

df1$timestamp <- as.POSIXct(df1$timestamp)

xts has some nice functions for working with timeseries. Especially for rolling functions or time aggregating functions. In this case period.apply can help out.

library(xts)

# create xts object. Be sure to exclude the timestamp column otherwise you end up with a character matrix.
df1_xts <- as.xts(df1[, -4], order.by = df1$timestamp)

# sum per 5 minute intervals
df1_xts_summed <- period.apply(df1_xts, endpoints(df1_xts, on = "minutes", k = 5), colSums)

# count rows per 5 minute interval and add to data
df1_xts_summed$nrows <- period.apply(df1_xts$T1, endpoints(df1_xts, on = "minutes", k = 5), nrow)

df1_xts_summed

T1 T2 T3 nrows
2015-11-05 10:24:00 90.31 89.03 26.37 2
2015-11-05 10:29:16 136.24 133.60 39.56 3
2015-11-05 10:32:00 45.96 44.85 13.23 1
2015-11-05 10:37:00 90.88 89.18 26.78 2

If you want it all back into a data.frame:

df_final <- data.frame(timestamp = index(df1_xts_summed), coredata(df1_xts_summed))
df_final

timestamp T1 T2 T3 nrows
1 2015-11-05 10:24:00 90.31 89.03 26.37 2
2 2015-11-05 10:29:16 136.24 133.60 39.56 3
3 2015-11-05 10:32:00 45.96 44.85 13.23 1
4 2015-11-05 10:37:00 90.88 89.18 26.78 2

Edit if you want everything rounded at 5 minutes with these as the timestamps you need to do the following:

First step is to replace the timestamps with the 5 minute intervals, taking into account the starting minutes of the timestamps. For this I use the ceiling_date from the lubridate package and add to it the difference between the first values of the timestamp and the ceiling of the first value of the timestamp. This will return the last values of each interval. (If you want to use the start of the interval you need to use floor_date)

df1$timestamp <- lubridate::ceiling_date(df1$timestamp, "5 mins") + difftime(lubridate::ceiling_date(first(df1$timestamp), "5 mins"), first(df1$timestamp), unit = "secs")

Next the same xts code as before which returns the same data, but the timestamp is now the last value of the 5 minute intervals.

df1_xts <- as.xts(df1[, -4], order.by = df1$timestamp)
df1_xts_summed <- period.apply(df1_xts, ep, colSums)
df1_xts_summed$nrows <- period.apply(df1_xts$T1, endpoints(df1_xts, on = "minutes", k = 5), nrow)
df_final <- data.frame(timestamp = index(df1_xts_summed), coredata(df1_xts_summed))
df_final

timestamp T1 T2 T3 nrows
1 2015-11-05 10:27:00 90.31 89.03 26.37 2
2 2015-11-05 10:32:00 136.24 133.60 39.56 3
3 2015-11-05 10:37:00 45.96 44.85 13.23 1
4 2015-11-05 10:42:00 90.88 89.18 26.78 2

data:

df1 <- structure(list(T1 = c(45.37, 44.94, 45.32, 45.46, 45.46, 45.96, 
45.52, 45.36), T2 = c(44.48, 44.55, 44.44, 44.51, 44.65, 44.85,
44.56, 44.62), T3 = c(13, 13.37, 13.09, 13.29, 13.18, 13.23,
13.53, 13.25), timestamp = c("2015-11-05 10:23:00", "2015-11-05 10:24:00",
"2015-11-05 10:27:00", "2015-11-05 10:28:00", "2015-11-05 10:29:16",
"2015-11-05 10:32:00", "2015-11-05 10:36:00", "2015-11-05 10:37:00"
)), class = "data.frame", row.names = c(NA, -8L))

Converting minutes data to hourly data

Use Grouper for aggregate by hours with column Unique_id by sum:

df['datetime'] = pd.to_datetime(df['datetime'])
df = df.groupby([pd.Grouper(freq='H', key='datetime'), 'Unique_id']).sum().reset_index()
print (df)
datetime Unique_id Value
0 2018-01-28 00:00:00 105714 1285
1 2018-01-28 00:00:00 206714 888
2 2018-01-28 23:00:00 105714 225
3 2018-01-28 23:00:00 206714 410


Related Topics



Leave a reply



Submit