Regular Analysis Over Irregular Time Series

Regular analysis over irregular time series

Use align.time to put the index of s into the periods you're interested in. Then use period.apply to find the length of each 3-hour window. Then merge it with an empty xts object that has all the index values you want.

# align index into 3-hour blocks
a <- align.time(s, n=60*60*3)
# find the number of obs in each block
count <- period.apply(a, endpoints(a, "hours", 3), length)
# create an empty xts object with the desired index
e <- xts(,seq(start(a),end(a),by="3 hours"))
# merge the counts with the empty object and fill with zeros
out <- merge(e,count,fill=0)

How do I aggregate and sum irregular time-series data based on a desired regular time interval in R?

Read the data frame into a zoo object and calculate the times truncated to the
minute, cumsum'ing over the values by minute. Then calculate a sequence of times from one minute before to one minute after the times in our data, remove
any times already in the data and merge it filling with zeros. If you don't really need the added zero times then stop after computing zcum:

library(zoo)

z <- read.zoo(df, tz = "")
mins <- trunc(time(z), "mins")
zcum <- ave(z, mins, FUN = cumsum)

rng <- range(mins)
tt <- seq(rng[1] - 60, rng[2] + 60, by = "mins")
tt <- tt[ ! format(tt) %in% format(mins) ]
merge(zcum, zoo(, tt), fill = 0)

giving:

2015-02-05 16:27:00 2015-02-05 16:28:38 2015-02-05 16:29:36 2015-02-05 16:29:41 
               0.00                0.01                0.01                0.02 
2015-02-05 16:30:00 2015-02-05 16:31:00 
               0.01                0.00

How to analyse irregular time-series in R

I have analysed such irregular data in the past using an additive model to "decompose" the seasonal and trend components. As this is a regression-based approach you need to model the residuals as a time series process to account for lack of independence in the residuals.

I used the mgcv package for these analysis. Essentially the model fitted is:

require(mgcv)
require(nlme)
mod <- gamm(response ~ s(dayOfYear, bs = "cc") + s(timeOfSampling), data = foo,
            correlation = corCAR1(form = ~ timeOfSampling))

Which fits a cyclic spline in the day of the year variable dayOfYear for the seasonal term and the trend is represented by timeOfSampling which is a numeric variable. The residuals are modelled here as a continuous-time AR(1) using the timeOfSampling variable as the time component of the CAR(1). This assumes that with increasing temporal separation, the correlation between residuals drops off exponentially.

I have written some blog posts on some of these ideas:

Smoothing temporally correlated data
Additive modelling and the HadCRUT3v global mean temperature series

which contain additional R code for you to follow.

Generating regular time series from irregular time series in pandas

Starting with:

                 ERRORCODE  ERRORTEXT SERVICENAME  REQTDURATION  RESPTDURATION  \
10:00:27:000        NaN        NaN    serviceA             0              1   
10:00:27:822        NaN        NaN    serviceB             0              1   
10:01:27:622         -1  'Timeout'    serviceA             1              0   
10:01:27:323        NaN        NaN    serviceD             0              1   
10:01:27:755        NaN        NaN    serviceA             0              1   
10:02:27:666         -5  'Timeout'    serviceA             0              1   
10:02:27:111        NaN        NaN    serviceB             0              1   
10:02:27:333        NaN        NaN    serviceC             0              1   

              HOSTDURATION  
10:00:27:000          4612  
10:00:27:822         14994  
10:01:27:622          7695  
10:01:27:323          2612  
10:01:27:755          1612  
10:02:27:666         11612  
10:02:27:111        111112  
10:02:27:333           412

Converting index to DateTimeIndex:

df.index = pd.to_datetime(df.index, format='%H:%M:%S:%f')

And then looping over SERVICENAME groups:

for service, data in df.groupby('SERVICENAME'):
    service_result = pd.concat([data.groupby(pd.TimeGrouper('Min')).size(), data.groupby(pd.TimeGrouper('Min'))['REQTDURATION', 'RESPTDURATION', 'HOSTDURATION'].mean()], axis=1)
    service_result.columns = ['ERRORCOUNT', 'AVGREQTURATION', 'AVGRESPTDURATION', 'AVGHOSTDURATION']
    service_result.index = service_result.index.time

yields:

serviceA

          ERRORCOUNT  AVGREQTURATION  AVGRESPTDURATION  AVGHOSTDURATION
10:00:00           1             0.0               1.0           4612.0
10:01:00           2             0.5               0.5           4653.5
10:02:00           1             0.0               1.0          11612.0

 serviceB
          ERRORCOUNT  AVGREQTURATION  AVGRESPTDURATION  AVGHOSTDURATION
10:00:00           1               0                 1            14994
10:01:00           0             NaN               NaN              NaN
10:02:00           1               0                 1           111112

 serviceC
          ERRORCOUNT  AVGREQTURATION  AVGRESPTDURATION  AVGHOSTDURATION
10:02:00           1               0                 1              412

 serviceD
          ERRORCOUNT  AVGREQTURATION  AVGRESPTDURATION  AVGHOSTDURATION
10:01:00           1               0                 1             2612

Splitting irregular time series into regular monthly averages - R

Here's a start using data.table :

billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)

require(data.table)
DT = as.data.table(billdata)

First, change type of columns begin and end to dates. Unlike data.frame, this doesn't copy the entire dataset.

DT[,begin:=as.Date(begin)]
DT[,end:=as.Date(end)]

Then find the time span, find the prevailing bill for each day, and aggregate.

alldays = DT[,seq(min(begin),max(end),by="day")]

setkey(DT, acct, begin)

DT[CJ(unique(acct),alldays),
   mean(amount/days,na.rm=TRUE),
   by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

    acct   month        V1
 1: 2242 2009-10 391.34483
 2: 2242 2009-11 406.69448
 3: 2242 2009-12 601.43226
 4: 2242 2010-01 646.27465
 5: 2242 2010-02 653.32143
 6: 2243 2009-10 938.51724
 7: 2243 2009-11  97.36172
 8: 2243 2009-12 375.68065
 9: 2243 2010-01 415.51429
10: 2243 2010-02 415.51429

I think you'll find the prevailing join logic quite cumbersome in SQL, and slower.

I say it's a hint because it's not quite correct. Notice row 10 is repeated because account 2243 doesn't stretch into 2010-02 unlike account 2242. To finish it off you could rbind in the last row for each account and use rolltolast instead of roll. Or perhaps create alldays by account rather than across all accounts.

See if speed is acceptable on the above, and we can go from there.

It's likely you will hit a bug in 1.8.2 that has been fixed in 1.8.3. I'm using v1.8.3.

"Internal" error message when combining join containing missing groups and group by
is fixed, #2162. For example :
X[Y,.N,by=NonJoinColumn]
where Y contains some rows that don't match to X. This bug could also result in a seg
fault.

Let me know and we can either work around, or upgrade to 1.8.3 from R-Forge.

Btw, nice example data. That made it quicker to answer.

Here's the full answer alluded to above. It's a bit tricky I have to admit, as it combines together several features of data.table. This should work in 1.8.2 as it happens, but I've only tested in 1.8.3.

DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]),
    mean(amount/days,na.rm=TRUE),
    by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

   acct   month        V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11  97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429

Regular Analysis Over Irregular Time Series