Regular analysis over irregular time series
Use align.time
to put the index of s
into the periods you're interested in. Then use period.apply
to find the length of each 3-hour window. Then merge it with an empty xts object that has all the index values you want.
# align index into 3-hour blocks
a <- align.time(s, n=60*60*3)
# find the number of obs in each block
count <- period.apply(a, endpoints(a, "hours", 3), length)
# create an empty xts object with the desired index
e <- xts(,seq(start(a),end(a),by="3 hours"))
# merge the counts with the empty object and fill with zeros
out <- merge(e,count,fill=0)
How do I aggregate and sum irregular time-series data based on a desired regular time interval in R?
Read the data frame into a zoo object and calculate the times truncated to the
minute, cumsum'ing over the values by minute. Then calculate a sequence of times from one minute before to one minute after the times in our data, remove
any times already in the data and merge it filling with zeros. If you don't really need the added zero times then stop after computing zcum
:
library(zoo)
z <- read.zoo(df, tz = "")
mins <- trunc(time(z), "mins")
zcum <- ave(z, mins, FUN = cumsum)
rng <- range(mins)
tt <- seq(rng[1] - 60, rng[2] + 60, by = "mins")
tt <- tt[ ! format(tt) %in% format(mins) ]
merge(zcum, zoo(, tt), fill = 0)
giving:
2015-02-05 16:27:00 2015-02-05 16:28:38 2015-02-05 16:29:36 2015-02-05 16:29:41
0.00 0.01 0.01 0.02
2015-02-05 16:30:00 2015-02-05 16:31:00
0.01 0.00
How to analyse irregular time-series in R
I have analysed such irregular data in the past using an additive model to "decompose" the seasonal and trend components. As this is a regression-based approach you need to model the residuals as a time series process to account for lack of independence in the residuals.
I used the mgcv package for these analysis. Essentially the model fitted is:
require(mgcv)
require(nlme)
mod <- gamm(response ~ s(dayOfYear, bs = "cc") + s(timeOfSampling), data = foo,
correlation = corCAR1(form = ~ timeOfSampling))
Which fits a cyclic spline in the day of the year variable dayOfYear
for the seasonal term and the trend is represented by timeOfSampling
which is a numeric variable. The residuals are modelled here as a continuous-time AR(1) using the timeOfSampling
variable as the time component of the CAR(1). This assumes that with increasing temporal separation, the correlation between residuals drops off exponentially.
I have written some blog posts on some of these ideas:
- Smoothing temporally correlated data
- Additive modelling and the HadCRUT3v global mean temperature series
which contain additional R code for you to follow.
Generating regular time series from irregular time series in pandas
Starting with:
ERRORCODE ERRORTEXT SERVICENAME REQTDURATION RESPTDURATION \
10:00:27:000 NaN NaN serviceA 0 1
10:00:27:822 NaN NaN serviceB 0 1
10:01:27:622 -1 'Timeout' serviceA 1 0
10:01:27:323 NaN NaN serviceD 0 1
10:01:27:755 NaN NaN serviceA 0 1
10:02:27:666 -5 'Timeout' serviceA 0 1
10:02:27:111 NaN NaN serviceB 0 1
10:02:27:333 NaN NaN serviceC 0 1
HOSTDURATION
10:00:27:000 4612
10:00:27:822 14994
10:01:27:622 7695
10:01:27:323 2612
10:01:27:755 1612
10:02:27:666 11612
10:02:27:111 111112
10:02:27:333 412
Converting index
to DateTimeIndex
:
df.index = pd.to_datetime(df.index, format='%H:%M:%S:%f')
And then looping over SERVICENAME
groups:
for service, data in df.groupby('SERVICENAME'):
service_result = pd.concat([data.groupby(pd.TimeGrouper('Min')).size(), data.groupby(pd.TimeGrouper('Min'))['REQTDURATION', 'RESPTDURATION', 'HOSTDURATION'].mean()], axis=1)
service_result.columns = ['ERRORCOUNT', 'AVGREQTURATION', 'AVGRESPTDURATION', 'AVGHOSTDURATION']
service_result.index = service_result.index.time
yields:
serviceA
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:00:00 1 0.0 1.0 4612.0
10:01:00 2 0.5 0.5 4653.5
10:02:00 1 0.0 1.0 11612.0
serviceB
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:00:00 1 0 1 14994
10:01:00 0 NaN NaN NaN
10:02:00 1 0 1 111112
serviceC
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:02:00 1 0 1 412
serviceD
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:01:00 1 0 1 2612
Splitting irregular time series into regular monthly averages - R
Here's a start using data.table
:
billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)
require(data.table)
DT = as.data.table(billdata)
First, change type of columns begin
and end
to dates. Unlike data.frame, this doesn't copy the entire dataset.
DT[,begin:=as.Date(begin)]
DT[,end:=as.Date(end)]
Then find the time span, find the prevailing bill for each day, and aggregate.
alldays = DT[,seq(min(begin),max(end),by="day")]
setkey(DT, acct, begin)
DT[CJ(unique(acct),alldays),
mean(amount/days,na.rm=TRUE),
by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]
acct month V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11 97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429
10: 2243 2010-02 415.51429
I think you'll find the prevailing join logic quite cumbersome in SQL, and slower.
I say it's a hint because it's not quite correct. Notice row 10 is repeated because account 2243 doesn't stretch into 2010-02 unlike account 2242. To finish it off you could rbind
in the last row for each account and use rolltolast
instead of roll
. Or perhaps create alldays
by account rather than across all accounts.
See if speed is acceptable on the above, and we can go from there.
It's likely you will hit a bug in 1.8.2 that has been fixed in 1.8.3. I'm using v1.8.3.
"Internal" error message when combining join containing missing groups and group by
is fixed, #2162. For example :
X[Y,.N,by=NonJoinColumn]
where Y contains some rows that don't match to X. This bug could also result in a seg
fault.
Let me know and we can either work around, or upgrade to 1.8.3 from R-Forge.
Btw, nice example data. That made it quicker to answer.
Here's the full answer alluded to above. It's a bit tricky I have to admit, as it combines together several features of data.table
. This should work in 1.8.2 as it happens, but I've only tested in 1.8.3.
DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]),
mean(amount/days,na.rm=TRUE),
by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]
acct month V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11 97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429
Related Topics
Plot Causes "Error: Incorrect Number of Dimensions"
Remove Rows Where All Variables Are Na Using Dplyr
Extracting Coefficient Variable Names from Glmnet into a Data.Frame
Plot Random Effects from Lmer (Lme4 Package) Using Qqmath or Dotplot: How to Make It Look Fancy
Formatting Mouse Over Labels in Plotly When Using Ggplotly
Diagnosing R Package Build Warning: "Latex Errors When Creating PDF Version"
How to Make a Matrix from a List of Vectors in R
Here We Go Again: Append an Element to a List in R
Reorder Rows Using Custom Order
What Is a Good Way to Read Line-By-Line in R
Can't Change Fonts in Ggplot/Geom_Text
Simple Frequency Tables Using Data.Table
How Can Put Multiple Plots Side-By-Side in Shiny R
How to Do Conditional Grouping of Data in R
Geom_Col Is Assigning the Wrong Independent Variable