Basic Lag in R Vector/Dataframe

Basic lag in R vector/dataframe

Another way to deal with this is using the zoo package, which has a lag method that will pad the result with NA:

require(zoo)
> set.seed(123)
> x <- zoo(sample(c(1:9), 10, replace = T))
> y <- lag(x, -1, na.pad = TRUE)
> cbind(x, y)
   x  y
1  3 NA
2  8  3
3  4  8
4  8  4
5  9  8
6  1  9
7  5  1
8  9  5
9  5  9
10 5  5

The result is a multivariate zoo object (which is an enhanced matrix), but easily converted to a data.frame via

> data.frame(cbind(x, y))

What's the opposite function to lag for an R vector/dataframe?

How about the built-in 'lead' function? (from the dplyr package)
Doesn't it do exactly the job of Ahmed's function?

cbind(x, lead(y, 1))

If you want to be able to calculate either positive or negative lags in the same function, i suggest a 'shorter' version of his 'shift' function:

shift = function(x, lag) {
  require(dplyr)
  switch(sign(lag)/2+1.5, lead(x, abs(lag)), lag(x, abs(lag)))
}

What it does is creating 2 cases, one with lag the other with lead, and chooses one case depending on the sign of your lag (the +1.5 is a trick to transform a {-1, +1} into a {1, 2} alternative).

How can I automatically create n lags in a timeseries?

If you are looking for efficiency, try data.tables new shift function

library(data.table) # V >= 1.9.5
n <- 2
setDT(df)[, paste("t", 1:n) := shift(t, 1:n)][]
#    t t 1 t 2
# 1: 1  NA  NA
# 2: 2   1  NA
# 3: 3   2   1
# 4: 4   3   2
# 5: 5   4   3
# 6: 6   5   4

Here you can set any name for your new columns (within paste) and you also don't need to bind this back to the original as this updates your data set by reference using the := operator.

Lag in R dataframe

Just proceed through the steps you outlined step-by-step and it isn't so bad.

First I'll read in your data by copying it:

df <- read.csv(file('clipboard'))

Then I'll sort to make sure the data frame is ordered by houseid, then personid, then tripid:

# first sort so that it's ordered by Houseid, then Personid, then Tripid:
df <- with(df, df[order(Houseid,Personid,Tripid),])

Then follow the steps you specified:

# take value in TripendTAZ and put it in DestTAZ
df$DestTAZ <- df$TripendTAZ

# Set OrigTAZ = value from previous row
df$OrigTAZ <- c(NA,df$TripendTAZ[-nrow(df)])

# For the first trip of every person in a household (Tripid = 1),
#  OrigTAZ = hometaz. 
df$OrigTAZ[ df$Tripid==1 ] <- df$hometaz[ df$Tripid==1 ]

You'll notice that df is then what you're after.

How to create lag variables

In base R the function lag() is useful for time series objects. Here you have a dataframe and the situation is somewhat different.

You could try the following, which I admit is not very elegant:

df2$l1pm10 <- sapply(1:nrow(df2), function(x) df2$pm10[x+1])
df2$l1pm102 <- sapply(1:nrow(df2), function(x) df2$pm10[x-1])
#> df2
#   var1     pm10   l1pm10  l1pm102
#1     1 26.95607       NA         
#2     2       NA 32.83869 26.95607
#3     3 32.83869 39.95607       NA
#4     4 39.95607       NA 32.83869
#5     5       NA 40.95607 39.95607
#6     6 40.95607 33.95607       NA
#7     7 33.95607 28.95607 40.95607
#8     8 28.95607 32.34877 33.95607
#9     9 32.34877       NA 28.95607
#10   10       NA       NA 32.34877

An alternative consists in using the Lag() function (with capital "L") from the Hmiscpackage:

library(Hmisc)
df2$l1pm10 <- Lag(df2$pm10, -1)
df2$l1pm102 <- Lag(df2$pm10, +1)
#> df2
#   var1     pm10   l1pm10  l1pm102
#1     1 26.95607       NA       NA
#2     2       NA 32.83869 26.95607
#3     3 32.83869 39.95607       NA
#4     4 39.95607       NA 32.83869
#5     5       NA 40.95607 39.95607
#6     6 40.95607 33.95607       NA
#7     7 33.95607 28.95607 40.95607
#8     8 28.95607 32.34877 33.95607
#9     9 32.34877       NA 28.95607
#10   10       NA       NA 32.34877

Using lag function gives an atomic vector with all zeroes

There might be a conflict with the lag function from other packages, that would explain why this code worked on other scripts but not on this one.

try stats::lag instead of just lag to enforce which package you want to use. (or dplyr::lag which seems to work better for me at east) ?

Lagging/Differencing Variables in R

lag() expects a time series. (In R, class "ts" is the basic time-series class, used to represent data sampled at equispaced points in time. For more see ?ts.) So you can either convert x to a time-series, as demonstrated here, or make use one of the approaches suggested in another answer.

x <- as.ts(1:10)
y <- lag(x,1)
xy <- cbind(x,y)
xy
#Time Series:
#Start = 0 
#End = 10 
#Frequency = 1 
#    x  y
# 0 NA  1
# 1  1  2
# 2  2  3
# 3  3  4
# 4  4  5
# 5  5  6
# 6  6  7
# 7  7  8
# 8  8  9
# 9  9 10
#10 10 NA

What does lag function in R do?

plain vectors lag is a generic which means it can act differently on objects of different classes. Here we will only discuss how it works with a plain vector but in the last two sections we will also discuss "ts", "zoo" (and "zooreg") class objects and how they are lagged. As an example, we use this vector of values:

x <- c(11, 12, 13, 14)

tsp Realize that a time series is a sequence of times and the values at those times. Here we only have the values but not the times so lag conceptually adds regularly spaced default times of 1, 2, 3, 4 by adding a tsp attribute which is a triple that encodes the start time, the end time and the frequency (i.e. the reciprocal of the distance between successive times). We can encode the times 1, 2, 3, 4 as the tsp attribute c(1, 4, 1). 1 is the start time. 4 is the end time. The time points are all 1 apart (because the time differences 2-1, 3-2 and 4-3 each equal 1) and 1/1 = 1 so the frequency is 1. A quarterly series whose times are measured in years would have a frequency of 4 since each successive quarter would be 0.25 apart and 1/0.25 = 4. Similary, a monthly series measured in years would have a frequency of 12.

lag lag shifts the times one back. It does not change the values, only the times. Thus lag changes the tsp attribute from c(1, 4, 1) to c(0, 3, 1). The start time is shifted from 1 to 0, the end time is shifted from 4 to 3 and since shifts do not change the frequency the frequency remains 1.

> lag(x)
[1] 11 12 13 14
attr(,"tsp")
[1] 0 3 1

time The time function will produce an object whose values are the times of its argument and whose tsp attribute is the same as the tsp attribute of its argument (or the default tsp attribute if none). For example, as we already discussed the code below shows that the times of the plain vector x given above are 1, 2, 3, 4 and the times for lag(x) are 0, 1, 2, 3.

> time(x)
[1] 1 2 3 4
attr(,"tsp")
[1] 1 4 1
> time(lag(x))
[1] 0 1 2 3
attr(,"tsp")
[1] 0 3 1

ts Most operations on plain vectors ignore the tsp attribute so unless you do something with it its existence may be pointless. On the other hand, if the object were a "ts" class object then the various operations on "ts" objects do pay attention to the tsp attribute. For example, note where these plots start:

# plain vector
plot(x) # plot starts at time = 1
plot(lag(x)) # same, tsp was ignored

# ts object
plot(ts(x)) # plot starts at time = 1
plot(lag(ts(x))) # plot starts at time = 0, tsp was not ignored

zoo The series above was regularly spaced, i.e. the differences between successive times were the same. To represent irregularly spaced series one can use the "zoo" and "zooreg" classes in the zoo package. A zoo object is the values with an index attribute holding the times. The times are not encoded in a tsp attribute. For example, here we see that the zoo objects has times 1, 2, 3, 4 held and values 11, 12, 13, 14:

> library(zoo)
>
> str(zoo(x))
‘zoo’ series from 1 to 4
  Data: num [1:4] 11 12 13 14
  Index:  int [1:4] 1 2 3 4

The "zooreg" class is like "zoo" for objects which are regularly spaced except for some times that may be omitted. Internally "zooreg" objects are the same as "zoo" objects except the frequency is also stored. The values and index are the same as for zoo but we know have a frequency as well. Since the successive time points are 1 apart the frequency is 1.

> str(zooreg(x))
‘zooreg’ series from 1 to 4
  Data: num [1:4] 11 12 13 14
  Index:  num [1:4] 1 2 3 4
  Frequency: 1

If one lag a "zoo" object then each time is moved to the prior time and the first time dropped. Here the times are 1, 2, 3 and the values are 12, 13, 14. Note that the lagged series has a subset of the times of the original series. That is always the case when lagging a zoo series:

> lag(zoo(x))
 1  2  3 
12 13 14

Because "zooreg" objects have a frequency they can be lagged to times that did not exist in the original series. Each time point t is lagged to t - deltat where deltat is 1/frequency. Here 0, 1, 2, 3 are the lagged time points and the values are 11, 12, 13, 14:

> lag(zooreg(x))
 0  1  2  3 
11 12 13 14

dplyr The dplyr package has a lag function. Unfortunately it acts in the opposite direction of the base R lag function in that lag(x, k) moves each item in the series forward rather than backwards. This may actually be more intuitive but causes a lot of confusion due to the incompatibility with base R. If you use dplyr be very careful that you know whether dplyr is loaded or not.

dplyr's lag is particularly useful when used with data frames since given a vector (such as a column of a data frame) it always returns a vector of the same length. It has a default= argument which itself defaults to NA but can be specified by the user to determine what the empty value(s) at the beginning of the vector are to be filled in with. Negative lags are not allowed but the dplyr lead function can be used.

dplyr::lag(1:5)
## [1] NA  1  2  3  4

dplyr::lag(1:5, 2)
## [1] NA NA  1  2  3

dplyr::lead(1:5)
## [1]  2  3  4  5 NA

lag() and lead() in base-R

You could do something like this, where NAs are combined with a subset of df$a in lag_a, which is then compared with df$a:

lag_a <- c(rep(NA, 1), head(df$a, length(df$a) - 1))
df$groupstart <- df$a != lag_a | is.na(lag_a)

#### OUTPUT ####

  a groupstart
1 a       TRUE
2 a      FALSE
3 a      FALSE
4 b       TRUE
5 b      FALSE

You can generalize this principle in a function:

lead_lag <- function(v, n) {
    if (n > 0) c(rep(NA, n), head(v, length(v) - n))
    else c(tail(v, length(v) - abs(n)), rep(NA, abs(n)))
}

#### OUTPUT ####

lead_lag(df$a, 2)  #[1] NA  NA  "a" "a" "a"
lead_lag(df$a, -2) #[1] "a" "b" "b" NA  NA
lead_lag(df$a, 3)  #[1] NA  NA  NA  "a" "a"
lead_lag(df$a, -4) #[1] "b" NA  NA  NA  NA

How to create a lag variable within each group?

You could do this within data.table

 library(data.table)
 data[, lag.value:=c(NA, value[-.N]), by=groups]
  data
 #   time groups       value   lag.value
 #1:    1      a  0.02779005          NA
 #2:    2      a  0.88029938  0.02779005
 #3:    3      a -1.69514201  0.88029938
 #4:    1      b -1.27560288          NA
 #5:    2      b -0.65976434 -1.27560288
 #6:    3      b -1.37804943 -0.65976434
 #7:    4      b  0.12041778 -1.37804943

For multiple columns:

nm1 <- grep("^value", colnames(data), value=TRUE)
nm2 <- paste("lag", nm1, sep=".")
data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1]
 data
#    time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132

Update

From data.table versions >= v1.9.5, we can use shift with type as lag or lead. By default, the type is lag.

data[, (nm2) :=  shift(.SD), by=groups, .SDcols=nm1]
#   time groups      value     value1      value2  lag.value lag.value1
#1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA
#2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247
#3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814
#4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA
#5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812
#6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432
#7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406
#    lag.value2
#1:          NA
#2:  1.12493092
#3: -0.04493361
#4:          NA
#5:  0.94383621
#6:  0.82122120
#7:  0.59390132

If you need the reverse, use type=lead

nm3 <- paste("lead", nm1, sep=".")

Using the original dataset

  data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]
  #  time groups      value     value1      value2 lead.value lead.value1
  #1:    1      b -0.6264538  0.7383247  1.12493092  0.1836433   0.5757814
  #2:    2      b  0.1836433  0.5757814 -0.04493361 -0.8356286  -0.3053884
  #3:    3      b -0.8356286 -0.3053884 -0.01619026         NA          NA
  #4:    1      a  1.5952808  1.5117812  0.94383621  0.3295078   0.3898432
  #5:    2      a  0.3295078  0.3898432  0.82122120 -0.8204684  -0.6212406
  #6:    3      a -0.8204684 -0.6212406  0.59390132  0.4874291  -2.2146999
  #7:    4      a  0.4874291 -2.2146999  0.91897737         NA          NA
 #   lead.value2
 #1: -0.04493361
 #2: -0.01619026
 #3:          NA
 #4:  0.82122120
 #5:  0.59390132
 #6:  0.91897737
 #7:          NA

data

 set.seed(1)
 data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),
             value = rnorm(7), value1=rnorm(7), value2=rnorm(7))

Basic Lag in R Vector/Dataframe