Create lagged variable in unbalanced panel data in R
Using a function tlag
within groups defined by id
library(dplyr)
tlag <- function(x, n = 1L, time) {
index <- match(time - n, time, incomparables = NA)
x[index]
}
df %>% group_by(id) %>% mutate(value_lagged = tlag(value, 1, time = date))
How to create two events in unbalanced panel data?
A dplyr
solution:
library(dplyr)
df |>
group_by(id) |>
mutate(
unemployed = (status == 1) & (lag(status, default = status[1]) == 0),
inactive = (status == 2) & (lag(status, default = status[1]) != 2),
)
# A tibble: 18 x 5
# Groups: id [6]
# id wave status unemployed inactive
# <dbl> <dbl> <dbl> <lgl> <lgl>
# 1 1 1 0 FALSE FALSE
# 2 1 2 0 FALSE FALSE
# 3 1 5 1 TRUE FALSE
# 4 2 1 0 FALSE FALSE
# 5 2 2 0 FALSE FALSE
# 6 3 1 0 FALSE FALSE
# 7 3 2 2 FALSE TRUE
# 8 3 3 1 FALSE FALSE
# 9 4 1 0 FALSE FALSE
# 10 4 2 2 FALSE TRUE
# 11 4 4 0 FALSE FALSE
# 12 5 1 2 FALSE FALSE
# 13 5 3 0 FALSE FALSE
# 14 5 5 1 TRUE FALSE
# 15 5 6 1 FALSE FALSE
# 16 6 1 1 FALSE FALSE
# 17 6 3 2 FALSE TRUE
# 18 6 5 2 FALSE FALSE
I have left them as logical
rather than numeric
variables as I think that is the appropriate data type in this case, but you can change that by wrapping the relevant part in as.numeric()
, e.g. unemployed = as.numeric((status == 1) & (lag(status, default = status[1]) == 0))
.
I have assumed that:
- A person is unemployed only if they transition from being employed.
- A person is inactive if they move to being inactive from being employed or unemployed.
- A person should have the transition flag set to
TRUE
in the first period if they are inactive or unemployed in the first period - that is whatdefault = status[1]
is doing.
Also just for fun here is a data.table
solution:
library(data.table)
dt <- setDT(df)
dt[,
`:=` (
unemployed = (status == 1) & (shift(status, type = "lag", fill = status[1]) == 0),
inactive = (status == 2) & (shift(status, type = "lag", fill = status[1]) != 2)
),
keyby = id
]
This should be faster if your data set is very large.
Create lagged variables for consecutive time points only using R
You could use ifelse
, testing whether diff(time)
is equal to 1. If so, write the lag. If not, write an NA
.
base %>%
group_by(id) %>%
mutate(lag1_x = ifelse(c(0, diff(time)) == 1, lag(x, n = 1, default = NA), NA)) %>%
as.data.frame()
#> id time x lag1_x
#> 1 1 1 1.852343 NA
#> 2 1 2 2.710538 1.852343
#> 3 1 3 2.700785 2.710538
#> 4 1 4 2.588489 2.700785
#> 5 1 7 3.252223 NA
#> 6 1 8 2.108079 3.252223
#> 7 1 10 3.435683 NA
#> 8 2 3 1.762462 NA
#> 9 2 4 2.775732 1.762462
#> 10 2 6 3.377396 NA
#> 11 2 9 3.133336 NA
#> 12 2 10 3.804190 3.133336
#> 13 2 11 2.942893 3.804190
#> 14 2 14 3.503608 NA
R: quickly simulate unbalanced panel with variable that depends on lagged values of itself
One approach is to use a split-apply-combine method. I take out the for(t in (AgeMonthMin+1):AgeMonthMax)
loop and put the contents in a function:
generate_outcome <- function(x) {
AgeMonthMin <- min(x$AgeMonths, na.rm = TRUE)
AgeMonthMax <- max(x$AgeMonths, na.rm = TRUE)
for (i in 2:(AgeMonthMax - AgeMonthMin + 1)){
x$xb[i] <- x$xbsmall[i] + 4.47 * x$Outcome[i - 1]
x$Outcome[i] <- ifelse(x$xb[i] > - x$error[i], 1, 0)
}
x
}
where x
is a dataframe for one person. This allows us to simplify the panel$id==i & panel$AgeMonths==t
construct. Now we can just do
out <- lapply(split(panel, panel$id), generate_outcome)
out <- do.call(rbind, out)
and all.equal(panel$Outcome, out$Outcome)
returns TRUE
. Computing 100 persons took 1.8 seconds using this method, compared to 1.5 minutes in the original code.
Calculate lagged variable in unbalanced time series data.table
This is copypaste from Arun's comment from here
ts[, value_lagged := ts[.(time=time-5), value, roll=+Inf, rollends=TRUE, mult="first", on="time"]]
For LEAD you can just change the signs -
ts[, value_lead := ts[.(time=time+5), value, roll=-Inf, rollends=TRUE, mult="first", on="time"]]
general lag in time series panel data
You can use ddply
: it cuts a data.frame into pieces and transforms each piece.
d <- data.frame(
User = rep( LETTERS[1:3], each=10 ),
Date = seq.Date( Sys.Date(), length=30, by="day" ),
Value = rep(1:10, 3)
)
library(plyr)
d <- ddply(
d, .(User), transform,
# This assumes that the data is sorted
Value = c( NA, Value[-length(Value)] )
)
unbalanced panel data to long format
Here is a way to pivot longer with the {tidyr} package.
library(dplyr)
library(stringr)
library(tidyr)
dat %>%
rename("Wave" = "X") %>%
pivot_longer(-1, names_to = "id", values_to = "val") %>%
separate(id, c("id", "key"), sep = "(?<=MLC_\\d).") %>%
pivot_wider(names_from = key, values_from = val) %>%
mutate(across("id", str_replace, "MLC_", "")) %>%
arrange(id, Wave)
I'm sure there is a way to do it in one step, but I haven't figured it out yet. Will update this answer if I work it out.
Update
This is neater, pivoting done in one-shot:
dat %>%
rename("Wave" = "X") %>%
pivot_longer(-1,
names_to = c("id", ".value"),
names_pattern = "MLC_(\\d.*).(c.*)",
names_transform = list(id = as.integer)) %>%
arrange(id, Wave)
How to generate lagged variable for unbalanced panel in Pandas dataframe?
Use:
val = df.set_index('date').groupby('id').resample('MS').asfreq()['value']
val = val.groupby(level=0).shift(2)
df['lag2val'] = df.set_index(['id', 'date']).index.map(val)
Details:
STEP A: Use DataFrame.groupby
on id
and use groupby.resample
to resample the grouped frame using monthly start frequency.
print(val)
id date
1 1990-01-01 1.0
1990-02-01 2.0
1990-03-01 3.0
2 1989-12-01 3.0
1990-01-01 3.0
1990-02-01 4.0
1990-03-01 5.5
1990-04-01 5.0
1990-05-01 NaN
1990-06-01 6.0
Name: value, dtype: float64
STEP B: Use Series.groupby
on level=0
to group the series val
and shift
2 periods down to create a lagged 2
months val
series.
print(val)
id date
1 1990-01-01 NaN
1990-02-01 NaN
1990-03-01 1.0
2 1989-12-01 NaN
1990-01-01 NaN
1990-02-01 3.0
1990-03-01 3.0
1990-04-01 4.0
1990-05-01 5.5
1990-06-01 5.0
Name: value, dtype: float64
STEP C: Finally, use set_index
along with Series.map
to map the new lagged val
series to the orginal dataframe df
.
print(df)
id date value lag2val
0 1 1990-01-01 1.0 NaN
1 1 1990-02-01 2.0 NaN
2 1 1990-03-01 3.0 1.0
3 2 1989-12-01 3.0 NaN
4 2 1990-01-01 3.0 NaN
5 2 1990-02-01 4.0 3.0
6 2 1990-03-01 5.5 3.0
7 2 1990-04-01 5.0 4.0
8 2 1990-06-01 6.0 5.0
R: Unbalanced panel, create dummy for unique observations
Using dplyr
, you could avoid the loop and try this:
set.seed(123)
df <- data.frame(id = sample(1:10, 20, replace = TRUE),
happy = sample(c("yes", "no"), 20, replace = TRUE))
library(dplyr)
df <- df %>%
group_by(id) %>%
mutate(dummy = ifelse(length(id)>=2, 1, 0))
> df
# A tibble: 20 x 3
# Groups: id [10]
id happy dummy
<int> <fct> <dbl>
1 3 no 1
2 8 no 0
3 5 no 1
4 9 no 1
5 10 no 1
6 1 no 1
7 6 no 1
8 9 no 1
9 6 yes 1
10 5 yes 1
11 10 no 1
12 5 no 1
13 7 no 0
14 6 no 1
15 2 yes 0
16 9 yes 1
17 3 no 1
18 1 yes 1
19 4 yes 0
20 10 yes 1
Essentially, this approach divides up df
by unique values of id
and then creates a column dummy
that takes the value 1 if there are more than two occurrences of that id and 0 if not.
Related Topics
Email Dataframe as Table in Email Body with Sendmailr
Mutating Multiple Columns in a Data Frame Using Dplyr
Ggplot: Remove Na Factor Level in Legend
Finding the Bounding Box of Plotted Text
How to Manually Fill Colors in a Ggplot2 Histogram
How to Get the Zoom Level from the Leaflet Map in R/Shiny
Highlighting Individual Axis Labels in Bold Using Ggplot2
Create Category Based on Range in R
How to Create a R Timeseries for Hourly Data
Using Geo-Coordinates as Vertex Coordinates in the Igraph R-Package
Defer Code to End of Document in Knitr
R - How to Replace Parts of Variable Strings Within Data Frame
How to Screenshot a Website Using R
How to Filter a Range of Numbers in R
Error When Using Predict() on a Randomforest Object Trained with Caret's Train() Using Formula