Regular Analysis Over Irregular Time Series

Regular analysis over irregular time series

Use align.time to put the index of s into the periods you're interested in. Then use period.apply to find the length of each 3-hour window. Then merge it with an empty xts object that has all the index values you want.

# align index into 3-hour blocks
a <- align.time(s, n=60*60*3)
# find the number of obs in each block
count <- period.apply(a, endpoints(a, "hours", 3), length)
# create an empty xts object with the desired index
e <- xts(,seq(start(a),end(a),by="3 hours"))
# merge the counts with the empty object and fill with zeros
out <- merge(e,count,fill=0)

How do I aggregate and sum irregular time-series data based on a desired regular time interval in R?

Read the data frame into a zoo object and calculate the times truncated to the
minute, cumsum'ing over the values by minute. Then calculate a sequence of times from one minute before to one minute after the times in our data, remove
any times already in the data and merge it filling with zeros. If you don't really need the added zero times then stop after computing zcum:

library(zoo)

z <- read.zoo(df, tz = "")
mins <- trunc(time(z), "mins")
zcum <- ave(z, mins, FUN = cumsum)

rng <- range(mins)
tt <- seq(rng[1] - 60, rng[2] + 60, by = "mins")
tt <- tt[ ! format(tt) %in% format(mins) ]
merge(zcum, zoo(, tt), fill = 0)

giving:

2015-02-05 16:27:00 2015-02-05 16:28:38 2015-02-05 16:29:36 2015-02-05 16:29:41 
0.00 0.01 0.01 0.02
2015-02-05 16:30:00 2015-02-05 16:31:00
0.01 0.00

How to analyse irregular time-series in R

I have analysed such irregular data in the past using an additive model to "decompose" the seasonal and trend components. As this is a regression-based approach you need to model the residuals as a time series process to account for lack of independence in the residuals.

I used the mgcv package for these analysis. Essentially the model fitted is:

require(mgcv)
require(nlme)
mod <- gamm(response ~ s(dayOfYear, bs = "cc") + s(timeOfSampling), data = foo,
correlation = corCAR1(form = ~ timeOfSampling))

Which fits a cyclic spline in the day of the year variable dayOfYear for the seasonal term and the trend is represented by timeOfSampling which is a numeric variable. The residuals are modelled here as a continuous-time AR(1) using the timeOfSampling variable as the time component of the CAR(1). This assumes that with increasing temporal separation, the correlation between residuals drops off exponentially.

I have written some blog posts on some of these ideas:

  1. Smoothing temporally correlated data
  2. Additive modelling and the HadCRUT3v global mean temperature series

which contain additional R code for you to follow.

Generating regular time series from irregular time series in pandas

Starting with:

                 ERRORCODE  ERRORTEXT SERVICENAME  REQTDURATION  RESPTDURATION  \
10:00:27:000 NaN NaN serviceA 0 1
10:00:27:822 NaN NaN serviceB 0 1
10:01:27:622 -1 'Timeout' serviceA 1 0
10:01:27:323 NaN NaN serviceD 0 1
10:01:27:755 NaN NaN serviceA 0 1
10:02:27:666 -5 'Timeout' serviceA 0 1
10:02:27:111 NaN NaN serviceB 0 1
10:02:27:333 NaN NaN serviceC 0 1

HOSTDURATION
10:00:27:000 4612
10:00:27:822 14994
10:01:27:622 7695
10:01:27:323 2612
10:01:27:755 1612
10:02:27:666 11612
10:02:27:111 111112
10:02:27:333 412

Converting index to DateTimeIndex:

df.index = pd.to_datetime(df.index, format='%H:%M:%S:%f')

And then looping over SERVICENAME groups:

for service, data in df.groupby('SERVICENAME'):
service_result = pd.concat([data.groupby(pd.TimeGrouper('Min')).size(), data.groupby(pd.TimeGrouper('Min'))['REQTDURATION', 'RESPTDURATION', 'HOSTDURATION'].mean()], axis=1)
service_result.columns = ['ERRORCOUNT', 'AVGREQTURATION', 'AVGRESPTDURATION', 'AVGHOSTDURATION']
service_result.index = service_result.index.time

yields:

serviceA

ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:00:00 1 0.0 1.0 4612.0
10:01:00 2 0.5 0.5 4653.5
10:02:00 1 0.0 1.0 11612.0

serviceB
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:00:00 1 0 1 14994
10:01:00 0 NaN NaN NaN
10:02:00 1 0 1 111112

serviceC
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:02:00 1 0 1 412

serviceD
ERRORCOUNT AVGREQTURATION AVGRESPTDURATION AVGHOSTDURATION
10:01:00 1 0 1 2612

Splitting irregular time series into regular monthly averages - R

Here's a start using data.table :

billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)

require(data.table)
DT = as.data.table(billdata)

First, change type of columns begin and end to dates. Unlike data.frame, this doesn't copy the entire dataset.

DT[,begin:=as.Date(begin)]
DT[,end:=as.Date(end)]

Then find the time span, find the prevailing bill for each day, and aggregate.

alldays = DT[,seq(min(begin),max(end),by="day")]

setkey(DT, acct, begin)

DT[CJ(unique(acct),alldays),
mean(amount/days,na.rm=TRUE),
by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

acct month V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11 97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429
10: 2243 2010-02 415.51429

I think you'll find the prevailing join logic quite cumbersome in SQL, and slower.

I say it's a hint because it's not quite correct. Notice row 10 is repeated because account 2243 doesn't stretch into 2010-02 unlike account 2242. To finish it off you could rbind in the last row for each account and use rolltolast instead of roll. Or perhaps create alldays by account rather than across all accounts.

See if speed is acceptable on the above, and we can go from there.

It's likely you will hit a bug in 1.8.2 that has been fixed in 1.8.3. I'm using v1.8.3.

"Internal" error message when combining join containing missing groups and group by
is fixed, #2162. For example :
X[Y,.N,by=NonJoinColumn]
where Y contains some rows that don't match to X. This bug could also result in a seg
fault.

Let me know and we can either work around, or upgrade to 1.8.3 from R-Forge.

Btw, nice example data. That made it quicker to answer.


Here's the full answer alluded to above. It's a bit tricky I have to admit, as it combines together several features of data.table. This should work in 1.8.2 as it happens, but I've only tested in 1.8.3.

DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]),
mean(amount/days,na.rm=TRUE),
by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]

acct month V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11 97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429


Related Topics



Leave a reply



Submit