R Ts with Missing Values

Creating a ts time series with missing values from a data frame

Instead of using the left_join an easier option is complete, convert it to a tsibble object which is now compatible with the forecast package functions

library(tidyverse)
library(tsibble)
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month"),
fill = list(value = NA)) %>%
as_tsibble(index = date)

# A tsibble: 94 x 2 [1D]
# date value
# <date> <dbl>
# 1 2010-02-01 1.02
# 2 2010-03-01 NA
# 3 2010-04-01 NA
# 4 2010-05-01 1.75
# 5 2010-06-01 NA
# 6 2010-07-01 NA
# 7 2010-08-01 -0.233
# 8 2010-09-01 NA
# 9 2010-10-01 NA
#10 2010-11-01 -0.987
# ... with 84 more rows

As mentioned above, it is compatible withe forecast functions

library(fable)
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month"),
fill = list(value = 0)) %>%
as_tsibble(index = date) %>%
ETS(value) %>%
forecast %>%
autoplot

NOTE: Here, the missing values are imputed as 0.

Sample Image

It can be imputed with the previous non-NA value with fill

time_data %>% 
complete(date = seq(min(date), max(date), by = "1 month")) %>%
fill(value) %>%
as_tsibble(index = date) %>%
ETS(value) %>%
forecast %>%
autoplot

data

n_dates <- 3

R: ts() with NA data

Any idea how to deal with the NA, if that is the reason?

You need to use na.action = na.omit, i.e., dropping NA when doing computation.

plot(stl(ts.dat, s.window=12, na.action = na.omit))

stl

The na.pass will simply assume NA as normal observation. But it will still produce error as stl() later calls compiled code and can not recognize NA.

Any chance to take the first date from dat as the start?

Have a look at the examples at the bottom of ?ts:

 ## Using July 1954 as start date:
gnp <- ts(cumsum(1 + round(rnorm(100), 2)),
start = c(1954, 7), frequency = 12)

To start from July 1954, put start = c(1954, 7).

Fill missing values in time series using previous day data - R

Edit

Thanks to @G. Grothendieck to mention that na.locf0 has maxgap argument which can handle the 5-day condition directly.

data[-1] <- lapply(data[-1], zoo::na.locf0, maxgap = 5)
data

Earlier Answer

You can write a function with rle and zoo::na.locf0 to replace NA only if the length of consecutive NA is less than equal to 5. Apply this function for multiple columns with lapply.

conditionally_replace_na <- function(x) {
ifelse(with(rle(is.na(x)), rep(lengths, lengths)) <= 5 & is.na(x),
zoo::na.locf0(x), x)
}

data[-1] <- lapply(data[-1], conditionally_replace_na)
data

# Date time_series_1 time_series_2 time_series_3
#1 01-01-2019 NA 10 8
#2 02-01-2019 5 10 10
#3 03-01-2019 10 10 20
#4 04-01-2019 20 6 40
#5 05-01-2019 30 6 40
#6 06-01-2019 30 8 40
#7 07-01-2019 7 NA 40
#8 08-01-2019 5 NA 40
39 09-01-2019 NA NA 5
#10 10-01-2019 NA NA 5
#11 11-01-2019 NA NA 7
#12 12-01-2019 NA NA 10
#13 13-01-2019 NA NA 11
#14 14-01-2019 NA NA 12
#15 15-01-2019 NA NA 12
#16 16-01-2019 NA NA 9
#17 17-01-2019 NA NA 10
#18 18-01-2019 NA NA 10
#19 19-01-2019 5 NA 11
#20 20-01-2019 5 NA 11
#21 21-01-2019 5 NA 11
#22 22-01-2019 6 NA 11

Function can also be applied with dplyr::across

library(dplyr)
data %>% mutate(across(starts_with('time_series'), conditionally_replace_na))

Handling missing values in time series

It is always better to have a specific example showing specific expected output so that there is little space for ambiguity and assumption. However, I have created a dummy data based on my understanding and tried to solve it accordingly.

If I have understood you correctly, you have time series data with data point every second but sometimes there are some seconds missing which you want to fill it with mean of that column.

We can achieve this using complete by generating a sequence for every second between the min and max Time_Stamp and fill the missing values by the mean in the respective column. ID looks like an unique identifier for each row so filled it with row_number().

library(dplyr)
library(tidyr)

df %>%
complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>%
mutate_at(vars(A:C), ~replace(., is.na(.), mean(., na.rm = TRUE))) %>%
mutate(ID = row_number())

# A tibble: 11 x 5
# Time_Stamp ID A B C
# <dttm> <int> <dbl> <dbl> <dbl>
# 1 2018-02-02 07:45:00 1 123 567 434
# 2 2018-02-02 07:45:01 2 234 100 110
# 3 2018-02-02 07:45:02 3 234 100 110
# 4 2018-02-02 07:45:03 4 176. 772. 744.
# 5 2018-02-02 07:45:04 5 176. 772. 744.
# 6 2018-02-02 07:45:05 6 176. 772. 744.
# 7 2018-02-02 07:45:06 7 176. 772. 744.
# 8 2018-02-02 07:45:07 8 176. 772. 744.
# 9 2018-02-02 07:45:08 9 176. 772. 744.
#10 2018-02-02 07:45:09 10 176. 772. 744.
#11 2018-02-02 07:45:10 11 112 2323 2323

If you check the column means for last 3 columns, you can see those value are accurately replaced.

colMeans(df[3:5])
# A B C
#175.75 772.50 744.25

data

df <- structure(list(ID = 1:4, Time_Stamp = structure(c(1517557500, 
1517557501, 1517557502, 1517557510), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), A = c(123L, 234L, 234L, 112L), B = c(567L,
100L, 100L, 2323L), C = c(434L, 110L, 110L, 2323L)), class = "data.frame",
row.names = c(NA, -4L))

which looks like

df

# ID Time_Stamp A B C
#1 1 2018-02-02 07:45:00 123 567 434
#2 2 2018-02-02 07:45:01 234 100 110
#3 3 2018-02-02 07:45:02 234 100 110
#4 4 2018-02-02 07:45:10 112 2323 2323

Handling missing values in time series replacing with previous values

The following code works perfectly

 df1<- df %>%
complete(Timestamp = seq(min(Timestamp), max(Timestamp), by = "sec")) %>%
fill(everything()) %>%
mutate(ID = row_number())

It adds missing data with the previous or last value before the missing data time is started.

Complete missing values in time series using previous day data - using R

If I understand correctly, the trick is that you want to fill downward except for the bottommost NAs. And the problem with tidyr's fill is that it goes all the way down.

This isn't a fully-tidyverse solution, but for this data:

library(dplyr)
library(tidyr)
data <- tribble(
~Date, ~time_series_1, ~time_series_2,
as.Date("2019-01-01"), NA, 10,
as.Date("2019-02-01"), 5, NA,
as.Date("2019-03-01"), 10, NA,
as.Date("2019-04-01"), 20, 6,
as.Date("2019-05-01"), 30, NA,
as.Date("2019-06-01"), NA, 8,
as.Date("2019-07-01"), 7, NA,
as.Date("2019-08-01"), 5, NA,
as.Date("2019-09-01"), NA, NA,
as.Date("2019-10-01"), NA, NA
)

You can determine the ending date for each time series separately:

LastTS1Date <- with( data, max(Date[!is.na(time_series_1)])) 
LastTS2Date <- with( data, max(Date[!is.na(time_series_2)]))

And then use baseR filter syntax to only change the part of the data frame that goes up to those dates:

data[data$Date <= LastTS1Date,] <-
data[data$Date <= LastTS1Date,] %>% fill(time_series_1)

data[data$Date <= LastTS2Date,] <-
data[data$Date <= LastTS2Date,] %>% fill(time_series_2)

Time series missing value imputation: How to use maxgap inside na_kalman?

However I want to avoid the replacement of any NA series longer than 1 and set maxgap equals 1

For me running your code with maxgap = 1 does exactly this.

In the input series you can see multiple NA gaps. Mostly only 1 single NA and only one series of 3 consecutive NAs.

Sample Image

After applying na_kalman with maxgap = 1 all single NAs get imputed as expected. The longer gap with 3 consecutive NAs is not changed.

Sample Image



Related Topics



Leave a reply



Submit