Create Lagged Variable in Unbalanced Panel Data in R

Create lagged variable in unbalanced panel data in R

Using a function tlag within groups defined by id

library(dplyr)
tlag <- function(x, n = 1L, time) {
index <- match(time - n, time, incomparables = NA)
x[index]
}

df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))

How to create two events in unbalanced panel data?

A dplyr solution:

library(dplyr)

df |>
group_by(id) |>
mutate(
unemployed = (status == 1) & (lag(status, default = status[1]) == 0),
inactive = (status == 2) & (lag(status, default = status[1]) != 2),
)

# A tibble: 18 x 5
# Groups: id [6]
# id wave status unemployed inactive
# <dbl> <dbl> <dbl> <lgl> <lgl>
# 1 1 1 0 FALSE FALSE
# 2 1 2 0 FALSE FALSE
# 3 1 5 1 TRUE FALSE
# 4 2 1 0 FALSE FALSE
# 5 2 2 0 FALSE FALSE
# 6 3 1 0 FALSE FALSE
# 7 3 2 2 FALSE TRUE
# 8 3 3 1 FALSE FALSE
# 9 4 1 0 FALSE FALSE
# 10 4 2 2 FALSE TRUE
# 11 4 4 0 FALSE FALSE
# 12 5 1 2 FALSE FALSE
# 13 5 3 0 FALSE FALSE
# 14 5 5 1 TRUE FALSE
# 15 5 6 1 FALSE FALSE
# 16 6 1 1 FALSE FALSE
# 17 6 3 2 FALSE TRUE
# 18 6 5 2 FALSE FALSE

I have left them as logical rather than numeric variables as I think that is the appropriate data type in this case, but you can change that by wrapping the relevant part in as.numeric(), e.g. unemployed = as.numeric((status == 1) & (lag(status, default = status[1]) == 0)).

I have assumed that:

  1. A person is unemployed only if they transition from being employed.
  2. A person is inactive if they move to being inactive from being employed or unemployed.
  3. A person should have the transition flag set to TRUE in the first period if they are inactive or unemployed in the first period - that is what default = status[1] is doing.

Also just for fun here is a data.table solution:

library(data.table)

dt <- setDT(df)

dt[,
`:=` (
unemployed = (status == 1) & (shift(status, type = "lag", fill = status[1]) == 0),
inactive = (status == 2) & (shift(status, type = "lag", fill = status[1]) != 2)
),
keyby = id
]

This should be faster if your data set is very large.

Create lagged variables for consecutive time points only using R

You could use ifelse, testing whether diff(time) is equal to 1. If so, write the lag. If not, write an NA.

base %>%
group_by(id) %>%
mutate(lag1_x = ifelse(c(0, diff(time)) == 1, lag(x, n = 1, default = NA), NA)) %>%
as.data.frame()
#> id time x lag1_x
#> 1 1 1 1.852343 NA
#> 2 1 2 2.710538 1.852343
#> 3 1 3 2.700785 2.710538
#> 4 1 4 2.588489 2.700785
#> 5 1 7 3.252223 NA
#> 6 1 8 2.108079 3.252223
#> 7 1 10 3.435683 NA
#> 8 2 3 1.762462 NA
#> 9 2 4 2.775732 1.762462
#> 10 2 6 3.377396 NA
#> 11 2 9 3.133336 NA
#> 12 2 10 3.804190 3.133336
#> 13 2 11 2.942893 3.804190
#> 14 2 14 3.503608 NA

R: quickly simulate unbalanced panel with variable that depends on lagged values of itself

One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax) loop and put the contents in a function:

generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}

where x is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t construct. Now we can just do

out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)

and all.equal(panel$Outcome, out$Outcome) returns TRUE. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.

Calculate lagged variable in unbalanced time series data.table

This is copypaste from Arun's comment from here

ts[, value_lagged := ts[.(time=time-5), value, roll=+Inf, rollends=TRUE, mult="first", on="time"]]

For LEAD you can just change the signs -

ts[, value_lead := ts[.(time=time+5), value, roll=-Inf, rollends=TRUE, mult="first", on="time"]]

general lag in time series panel data

You can use ddply: it cuts a data.frame into pieces and transforms each piece.

d <- data.frame( 
User = rep( LETTERS[1:3], each=10 ),
Date = seq.Date( Sys.Date(), length=30, by="day" ),
Value = rep(1:10, 3)
)
library(plyr)
d <- ddply(
d, .(User), transform,
# This assumes that the data is sorted
Value = c( NA, Value[-length(Value)] )
)

unbalanced panel data to long format

Here is a way to pivot longer with the {tidyr} package.

library(dplyr)
library(stringr)
library(tidyr)

dat %>%
rename("Wave" = "X") %>%
pivot_longer(-1, names_to = "id", values_to = "val") %>%
separate(id, c("id", "key"), sep = "(?<=MLC_\\d).") %>%
pivot_wider(names_from = key, values_from = val) %>%
mutate(across("id", str_replace, "MLC_", "")) %>%
arrange(id, Wave)

I'm sure there is a way to do it in one step, but I haven't figured it out yet. Will update this answer if I work it out.

Update

This is neater, pivoting done in one-shot:

dat %>%
rename("Wave" = "X") %>%
pivot_longer(-1,
names_to = c("id", ".value"),
names_pattern = "MLC_(\\d.*).(c.*)",
names_transform = list(id = as.integer)) %>%
arrange(id, Wave)

How to generate lagged variable for unbalanced panel in Pandas dataframe?

Use:

val = df.set_index('date').groupby('id').resample('MS').asfreq()['value']
val = val.groupby(level=0).shift(2)
df['lag2val'] = df.set_index(['id', 'date']).index.map(val)

Details:

STEP A: Use DataFrame.groupby on id and use groupby.resample to resample the grouped frame using monthly start frequency.

print(val)
id date
1 1990-01-01 1.0
1990-02-01 2.0
1990-03-01 3.0
2 1989-12-01 3.0
1990-01-01 3.0
1990-02-01 4.0
1990-03-01 5.5
1990-04-01 5.0
1990-05-01 NaN
1990-06-01 6.0
Name: value, dtype: float64

STEP B: Use Series.groupby on level=0 to group the series val and shift 2 periods down to create a lagged 2 months val series.

print(val)
id date
1 1990-01-01 NaN
1990-02-01 NaN
1990-03-01 1.0
2 1989-12-01 NaN
1990-01-01 NaN
1990-02-01 3.0
1990-03-01 3.0
1990-04-01 4.0
1990-05-01 5.5
1990-06-01 5.0
Name: value, dtype: float64

STEP C: Finally, use set_index along with Series.map to map the new lagged val series to the orginal dataframe df.

print(df)
id date value lag2val
0 1 1990-01-01 1.0 NaN
1 1 1990-02-01 2.0 NaN
2 1 1990-03-01 3.0 1.0
3 2 1989-12-01 3.0 NaN
4 2 1990-01-01 3.0 NaN
5 2 1990-02-01 4.0 3.0
6 2 1990-03-01 5.5 3.0
7 2 1990-04-01 5.0 4.0
8 2 1990-06-01 6.0 5.0

R: Unbalanced panel, create dummy for unique observations

Using dplyr, you could avoid the loop and try this:

set.seed(123)
df <- data.frame(id = sample(1:10, 20, replace = TRUE),
happy = sample(c("yes", "no"), 20, replace = TRUE))

library(dplyr)
df <- df %>%
group_by(id) %>%
mutate(dummy = ifelse(length(id)>=2, 1, 0))

> df
# A tibble: 20 x 3
# Groups: id [10]
id happy dummy
<int> <fct> <dbl>
1 3 no 1
2 8 no 0
3 5 no 1
4 9 no 1
5 10 no 1
6 1 no 1
7 6 no 1
8 9 no 1
9 6 yes 1
10 5 yes 1
11 10 no 1
12 5 no 1
13 7 no 0
14 6 no 1
15 2 yes 0
16 9 yes 1
17 3 no 1
18 1 yes 1
19 4 yes 0
20 10 yes 1

Essentially, this approach divides up df by unique values of id and then creates a column dummy that takes the value 1 if there are more than two occurrences of that id and 0 if not.



Related Topics



Leave a reply



Submit