Last Observation Carried Forward in a Data Frame

Last Observation Carried Forward In a data frame?

This already exists:

library(zoo)
na.locf(data.frame(rep("a",4), 1:4,1:4, c(1,NA,NA,NA)))

Last observation carried forward and last observation carried backward in R

This should work :

library(tidyverse)
df <- data.frame(ID=c(1,1,1,1,1,2,2,2,3,3), values=c(NA, NA, 12, 13, NA, 5, NA, NA, NA, 1))
df2 <- data.frame(ID=c(1,1,1,1,1,2,2,2,3,3), values=c(12, 12, 12, 13, 13, 5, 5, 5, 1, 1))

df <- df %>%
group_by(ID) %>%
fill(values, .direction = "downup") %>%
fill(values, .direction = "updown")

Last observation carried forward (LOCF) for all NA in R

You could use fill from tidyr.

tidyr::fill(df, everything())

# Value1 Value2 Value3 Group
#1 100 56 82 Group1
#2 150 56 97 Group2
#3 150 66 97 Group3
#4 120 66 97 Group4

Or with zoo::na.locf

df[] <- lapply(df, zoo::na.locf)

data

df <- structure(list(Value1 = c(100L, 150L, NA, 120L), Value2 = c(56L, 
NA, 66L, NA), Value3 = c(82L, 97L, NA, NA), Group = structure(1:4,
.Label = c("Group1", "Group2", "Group3", "Group4"), class = "factor")),
class = "data.frame", row.names = c(NA, -4L))

Carrying forward last observation with a limit, by group, using data.table

Note that normally when dealing with longer stretches of NA's one either fills them all or none of them and na.locf already handles that using the maxgap argument which only fills gaps that are no longer than specified. The idea is that interpolation is only reliable over short stretches so you should not interpolate at all over longer ones. Nevertheless, the following shows how to implement the scheme in the question but consider whether you should change strategy and use maxgap instead.

1) Compute na.locf using na.locf0 giving locf and also create a grouping variable for stretches of NA's and non-NAs, g. Then for each run of NA's take the first two elements of the na.locf column and fill the remainder with NA's from value. This does not overwrite df so it can be used in a pipeline without side effects.

library(data.table)
library(zoo)

df[, .(values, locf = na.locf0(values), g = rleid(is.na(values))), by = country][
, .(values = c(head(locf, 2), tail(values, -2))), by = .(country, g)][
, .(country, values)]

giving:

    country values
1: USA 2
2: USA 1
3: USA 1
4: USA 1
5: USA NA
6: FR 2
7: FR 1
8: FR 2
9: FR 2
10: FR 2

2) A slightly modified formulation that still uses the same basic idea is the following. It also does not overwrite.

library(data.table)
library(zoo)

# like na.locf0 but only specifies vector, x, and limit to fill, k
na.locf2 <- function(x, k) {
nalocf <- na.locf0(x)
f <- function(ix) c(head(nalocf[ix], k), tail(x[ix], -k))
unlist(tapply(seq_along(x), rleid(is.na(x)), f))
}
df[, .(values = na.locf2(values, 2)), by = country]

giving:

    country values
1: USA 2
2: USA 1
3: USA 1
4: USA 1
5: USA NA
6: FR 2
7: FR 1
8: FR 2
9: FR 2
10: FR 2

Last observation carried forward conditional on value and colums

This is the use of accumulate2 ie

df%>%
group_by(ID)%>%
mutate(d = unlist(accumulate2(var,cond[-1],function(z,x,y) if(y) min(z,x,na.rm=TRUE) else max(z,x,na.rm=TRUE))))
# A tibble: 15 x 5
# Groups: ID [3]
ID date cond var d
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 1 1
2 1 2 0 NA 1
3 1 3 0 2 2
4 1 4 1 0 0
5 1 5 0 NA 0
6 2 1 0 NA NA
7 2 2 0 3 3
8 2 3 0 NA 3
9 2 4 1 0 0
10 2 5 0 NA 0
11 3 1 0 2 2
12 3 2 0 NA 2
13 3 3 0 1 2
14 3 4 0 NA 2
15 3 5 0 NA 2

Last observation carried forward in a data frame

Sorry for writing an answer (not enough reputation to just comment)

But what prevents you from doing as @ProcrastinatusMaximus said?
(you can include the zoo call in your loop)

Would look like this:

for (i in badcols)
{
df[,i] <- zoo::na.locf(df[,i])
}

I am not sure if zoo is faster than your implementation. You would have to try this out. You could also check spacetime::na.locf, imputeTS::na.locf to see which of the existing locf implementations is the fastest.

Carry last observation forward (na.locf?) - but only when another column value changes

You can use fill:

df %>% 
mutate(sum = rowSums(!is.na(select(., DK0004612884:DK0002025170))),
across(DK0004612884:DK0002025170, ~ case_when(sum > 0 ~ replace_na(.x, 0)))) %>%
fill(DK0004612884:DK0002025170)

output

         Date DK0004612884 DK0002025170     ym sum
1 2021-01-29 14.48 0.58 2021-1 2
2 2021-02-01 14.48 0.58 2021-2 0
3 2021-02-02 14.48 0.58 2021-2 0
4 2021-02-03 14.48 0.58 2021-2 0
5 2021-02-04 14.48 0.58 2021-2 0
6 2021-02-05 14.48 0.58 2021-2 0
7 2021-02-08 14.48 0.58 2021-2 0
8 2021-02-09 14.48 0.58 2021-2 0
9 2021-02-10 14.48 0.58 2021-2 0
10 2021-02-11 14.48 0.58 2021-2 0
11 2021-02-12 14.48 0.58 2021-2 0
12 2021-02-15 14.48 0.58 2021-2 0
13 2021-02-16 14.48 0.58 2021-2 0
14 2021-02-17 14.48 0.58 2021-2 0
15 2021-02-18 14.48 0.58 2021-2 0
16 2021-02-19 14.48 0.58 2021-2 0
17 2021-02-22 14.48 0.58 2021-2 0
18 2021-02-23 14.48 0.58 2021-2 0
19 2021-02-24 14.48 0.58 2021-2 0
20 2021-02-25 14.48 0.58 2021-2 0
21 2021-02-26 14.15 0.00 2021-2 1
22 2021-03-01 14.15 0.00 2021-3 0
23 2021-03-02 14.15 0.00 2021-3 0
24 2021-03-03 14.15 0.00 2021-3 0
25 2021-03-04 14.15 0.00 2021-3 0
26 2021-03-05 14.15 0.00 2021-3 0
27 2021-03-08 14.15 0.00 2021-3 0
28 2021-03-09 14.15 0.00 2021-3 0
29 2021-03-10 14.15 0.00 2021-3 0
30 2021-03-11 14.15 0.00 2021-3 0
31 2021-03-12 14.15 0.00 2021-3 0

R Replace Intermittent NA Values With Last Observation Carried Forward (NA.LOCF)

We can use na.locf(..., fromLast = TRUE) to identify the trailing NA values and use pmax on them with the baseline. We'll demonstrate on the examples from your question in a nice all-together format:

# consolidate example data
dd = data.frame(
example = rep(1:3, each = 10),
visit = rep(1:10, 3),
value = c(34,NA,NA,15,16,19,NA,12,23,31,
34,22,18,15,16,19,NA,NA,NA,NA,
34,NA,NA,42,16,19,NA,38,NA,NA),
goal = c(34,34,34,15,16,19,19,12,23,31,
34,22,18,15,16,19,34,34,34,34,
34,34,34,42,16,19,19,38,38,38)
)

library(dplyr)
dd = dd %>% group_by(example) %>%
mutate(to_fill = !is.na(zoo::na.locf(value, fromLast = TRUE, na.rm = FALSE)),
result = if_else(to_fill,
zoo::na.locf(value, na.rm = FALSE),
pmax(first(value), zoo::na.locf(value, na.rm = FALSE))),
)

all(dd$goal == dd$result)
# [1] TRUE

As you can see, the result matches the goal column perfectly.



Related Topics



Leave a reply



Submit