Create Lagged Variable in Unbalanced Panel Data in R

Create lagged variable in unbalanced panel data in R

Using a function tlag within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) { 
  index <- match(time - n, time, incomparables = NA)
  x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))

How to create two events in unbalanced panel data?

A dplyr solution:

library(dplyr)

df  |>
  group_by(id)  |>
  mutate(
    unemployed = (status == 1) & (lag(status, default = status[1]) == 0),
    inactive = (status == 2) & (lag(status, default = status[1]) != 2),
  )

# A tibble: 18 x 5
# Groups:   id [6]
#       id  wave status unemployed inactive
#    <dbl> <dbl>  <dbl> <lgl>      <lgl>
#  1     1     1      0 FALSE      FALSE
#  2     1     2      0 FALSE      FALSE
#  3     1     5      1 TRUE       FALSE
#  4     2     1      0 FALSE      FALSE
#  5     2     2      0 FALSE      FALSE
#  6     3     1      0 FALSE      FALSE
#  7     3     2      2 FALSE      TRUE
#  8     3     3      1 FALSE      FALSE
#  9     4     1      0 FALSE      FALSE
# 10     4     2      2 FALSE      TRUE
# 11     4     4      0 FALSE      FALSE
# 12     5     1      2 FALSE      FALSE
# 13     5     3      0 FALSE      FALSE
# 14     5     5      1 TRUE       FALSE
# 15     5     6      1 FALSE      FALSE
# 16     6     1      1 FALSE      FALSE
# 17     6     3      2 FALSE      TRUE
# 18     6     5      2 FALSE      FALSE

I have left them as logical rather than numeric variables as I think that is the appropriate data type in this case, but you can change that by wrapping the relevant part in as.numeric(), e.g. unemployed = as.numeric((status == 1) & (lag(status, default = status[1]) == 0)).

I have assumed that:

A person is unemployed only if they transition from being employed.
A person is inactive if they move to being inactive from being employed or unemployed.
A person should have the transition flag set to TRUE in the first period if they are inactive or unemployed in the first period - that is what default = status[1] is doing.

Also just for fun here is a data.table solution:

library(data.table)

dt  <- setDT(df)

dt[, 
   `:=` (
    unemployed = (status == 1) & (shift(status, type = "lag", fill = status[1]) == 0), 
    inactive = (status == 2) & (shift(status, type = "lag", fill = status[1]) != 2)
   ), 
   keyby = id
]

This should be faster if your data set is very large.

Create lagged variables for consecutive time points only using R

You could use ifelse, testing whether diff(time) is equal to 1. If so, write the lag. If not, write an NA.

base %>%
  group_by(id) %>%
  mutate(lag1_x = ifelse(c(0, diff(time)) == 1, lag(x, n = 1, default = NA), NA)) %>% 
  as.data.frame()
#>    id time        x   lag1_x
#> 1   1    1 1.852343       NA
#> 2   1    2 2.710538 1.852343
#> 3   1    3 2.700785 2.710538
#> 4   1    4 2.588489 2.700785
#> 5   1    7 3.252223       NA
#> 6   1    8 2.108079 3.252223
#> 7   1   10 3.435683       NA
#> 8   2    3 1.762462       NA
#> 9   2    4 2.775732 1.762462
#> 10  2    6 3.377396       NA
#> 11  2    9 3.133336       NA
#> 12  2   10 3.804190 3.133336
#> 13  2   11 2.942893 3.804190
#> 14  2   14 3.503608       NA

R: quickly simulate unbalanced panel with variable that depends on lagged values of itself

One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:

generate_outcome <- function(x) {
  AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
  AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
  for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
    x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
    x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
  }
  x
}

where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do

out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)

and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

Calculate lagged variable in unbalanced time series data.table

This is copypaste from Arun's comment from here

ts[, value_lagged := ts[.(time=time-5), value, roll=+Inf, rollends=TRUE, mult="first", on="time"]]

For LEAD you can just change the signs -

ts[, value_lead := ts[.(time=time+5), value, roll=-Inf, rollends=TRUE, mult="first", on="time"]]

general lag in time series panel data

You can use ddply: it cuts a data.frame into pieces and transforms each piece.

d <- data.frame( 
  User = rep( LETTERS[1:3], each=10 ),
  Date = seq.Date( Sys.Date(), length=30, by="day" ),
  Value = rep(1:10, 3)
)
library(plyr)
d <- ddply( 
  d, .(User), transform,
  # This assumes that the data is sorted
  Value = c( NA, Value[-length(Value)] ) 
)

unbalanced panel data to long format

Here is a way to pivot longer with the {tidyr} package.

library(dplyr)
library(stringr)
library(tidyr)

dat %>%
  rename("Wave" = "X") %>%
  pivot_longer(-1, names_to = "id", values_to = "val") %>%
  separate(id, c("id", "key"), sep = "(?<=MLC_\\d).") %>%
  pivot_wider(names_from = key, values_from = val) %>%
  mutate(across("id", str_replace, "MLC_", "")) %>%
  arrange(id, Wave)

I'm sure there is a way to do it in one step, but I haven't figured it out yet. Will update this answer if I work it out.

Update

This is neater, pivoting done in one-shot:

dat %>%
  rename("Wave" = "X") %>%
  pivot_longer(-1,
               names_to = c("id", ".value"),
               names_pattern = "MLC_(\\d.*).(c.*)",
               names_transform = list(id = as.integer)) %>%
  arrange(id, Wave)

How to generate lagged variable for unbalanced panel in Pandas dataframe?

Use:

val = df.set_index('date').groupby('id').resample('MS').asfreq()['value']
val  = val.groupby(level=0).shift(2) 
df['lag2val'] = df.set_index(['id', 'date']).index.map(val)

Details:

STEP A: Use DataFrame.groupby on id and use groupby.resample to resample the grouped frame using monthly start frequency.

print(val)
id  date      
1   1990-01-01    1.0
    1990-02-01    2.0
    1990-03-01    3.0
2   1989-12-01    3.0
    1990-01-01    3.0
    1990-02-01    4.0
    1990-03-01    5.5
    1990-04-01    5.0
    1990-05-01    NaN
    1990-06-01    6.0
Name: value, dtype: float64

STEP B: Use Series.groupby on level=0 to group the series val and shift 2 periods down to create a lagged 2 months val series.

print(val)
id  date      
1   1990-01-01    NaN
    1990-02-01    NaN
    1990-03-01    1.0
2   1989-12-01    NaN
    1990-01-01    NaN
    1990-02-01    3.0
    1990-03-01    3.0
    1990-04-01    4.0
    1990-05-01    5.5
    1990-06-01    5.0
Name: value, dtype: float64

STEP C: Finally, use set_index along with Series.map to map the new lagged val series to the orginal dataframe df.

print(df)
   id       date  value  lag2val
0   1 1990-01-01    1.0      NaN
1   1 1990-02-01    2.0      NaN
2   1 1990-03-01    3.0      1.0
3   2 1989-12-01    3.0      NaN
4   2 1990-01-01    3.0      NaN
5   2 1990-02-01    4.0      3.0
6   2 1990-03-01    5.5      3.0
7   2 1990-04-01    5.0      4.0
8   2 1990-06-01    6.0      5.0

R: Unbalanced panel, create dummy for unique observations

Using dplyr, you could avoid the loop and try this:

set.seed(123)
df <- data.frame(id = sample(1:10, 20, replace = TRUE),
             happy = sample(c("yes", "no"), 20, replace = TRUE))

library(dplyr)
df <- df %>%
  group_by(id) %>%
  mutate(dummy = ifelse(length(id)>=2, 1, 0))

> df
# A tibble: 20 x 3
# Groups:   id [10]
      id happy dummy
   <int> <fct> <dbl>
 1     3 no        1
 2     8 no        0
 3     5 no        1
 4     9 no        1
 5    10 no        1
 6     1 no        1
 7     6 no        1
 8     9 no        1
 9     6 yes       1  
10     5 yes       1
11    10 no        1
12     5 no        1
13     7 no        0
14     6 no        1
15     2 yes       0
16     9 yes       1
17     3 no        1
18     1 yes       1
19     4 yes       0
20    10 yes       1

Essentially, this approach divides up df by unique values of id and then creates a column dummy that takes the value 1 if there are more than two occurrences of that id and 0 if not.

Create Lagged Variable in Unbalanced Panel Data in R