Creating a ts time series with missing values from a data frame
Instead of using the left_join
an easier option is complete
, convert it to a tsibble
object which is now compatible with the forecast
package functions
library(tidyverse)
library(tsibble)
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month"),
fill = list(value = NA)) %>%
as_tsibble(index = date)
# A tsibble: 94 x 2 [1D]
# date value
# <date> <dbl>
# 1 2010-02-01 1.02
# 2 2010-03-01 NA
# 3 2010-04-01 NA
# 4 2010-05-01 1.75
# 5 2010-06-01 NA
# 6 2010-07-01 NA
# 7 2010-08-01 -0.233
# 8 2010-09-01 NA
# 9 2010-10-01 NA
#10 2010-11-01 -0.987
# ... with 84 more rows
As mentioned above, it is compatible withe forecast
functions
library(fable)
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month"),
fill = list(value = 0)) %>%
as_tsibble(index = date) %>%
ETS(value) %>%
forecast %>%
autoplot
NOTE: Here, the missing values are imputed as 0.
It can be imputed with the previous non-NA value with fill
time_data %>%
complete(date = seq(min(date), max(date), by = "1 month")) %>%
fill(value) %>%
as_tsibble(index = date) %>%
ETS(value) %>%
forecast %>%
autoplot
data
n_dates <- 3
R: ts() with NA data
Any idea how to deal with the NA, if that is the reason?
You need to use na.action = na.omit
, i.e., dropping NA
when doing computation.
plot(stl(ts.dat, s.window=12, na.action = na.omit))
The na.pass
will simply assume NA
as normal observation. But it will still produce error as stl()
later calls compiled code and can not recognize NA
.
Any chance to take the first date from
dat
as the start?
Have a look at the examples at the bottom of ?ts
:
## Using July 1954 as start date:
gnp <- ts(cumsum(1 + round(rnorm(100), 2)),
start = c(1954, 7), frequency = 12)
To start from July 1954, put start = c(1954, 7)
.
Fill missing values in time series using previous day data - R
Edit
Thanks to @G. Grothendieck to mention that na.locf0
has maxgap
argument which can handle the 5-day condition directly.
data[-1] <- lapply(data[-1], zoo::na.locf0, maxgap = 5)
data
Earlier Answer
You can write a function with rle
and zoo::na.locf0
to replace NA
only if the length of consecutive NA
is less than equal to 5. Apply this function for multiple columns with lapply
.
conditionally_replace_na <- function(x) {
ifelse(with(rle(is.na(x)), rep(lengths, lengths)) <= 5 & is.na(x),
zoo::na.locf0(x), x)
}
data[-1] <- lapply(data[-1], conditionally_replace_na)
data
# Date time_series_1 time_series_2 time_series_3
#1 01-01-2019 NA 10 8
#2 02-01-2019 5 10 10
#3 03-01-2019 10 10 20
#4 04-01-2019 20 6 40
#5 05-01-2019 30 6 40
#6 06-01-2019 30 8 40
#7 07-01-2019 7 NA 40
#8 08-01-2019 5 NA 40
39 09-01-2019 NA NA 5
#10 10-01-2019 NA NA 5
#11 11-01-2019 NA NA 7
#12 12-01-2019 NA NA 10
#13 13-01-2019 NA NA 11
#14 14-01-2019 NA NA 12
#15 15-01-2019 NA NA 12
#16 16-01-2019 NA NA 9
#17 17-01-2019 NA NA 10
#18 18-01-2019 NA NA 10
#19 19-01-2019 5 NA 11
#20 20-01-2019 5 NA 11
#21 21-01-2019 5 NA 11
#22 22-01-2019 6 NA 11
Function can also be applied with dplyr::across
library(dplyr)
data %>% mutate(across(starts_with('time_series'), conditionally_replace_na))
Handling missing values in time series
It is always better to have a specific example showing specific expected output so that there is little space for ambiguity and assumption. However, I have created a dummy data based on my understanding and tried to solve it accordingly.
If I have understood you correctly, you have time series data with data point every second but sometimes there are some seconds missing which you want to fill it with mean
of that column.
We can achieve this using complete
by generating a sequence for every second between the min
and max
Time_Stamp
and fill the missing values by the mean
in the respective column. ID
looks like an unique identifier for each row so filled it with row_number()
.
library(dplyr)
library(tidyr)
df %>%
complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>%
mutate_at(vars(A:C), ~replace(., is.na(.), mean(., na.rm = TRUE))) %>%
mutate(ID = row_number())
# A tibble: 11 x 5
# Time_Stamp ID A B C
# <dttm> <int> <dbl> <dbl> <dbl>
# 1 2018-02-02 07:45:00 1 123 567 434
# 2 2018-02-02 07:45:01 2 234 100 110
# 3 2018-02-02 07:45:02 3 234 100 110
# 4 2018-02-02 07:45:03 4 176. 772. 744.
# 5 2018-02-02 07:45:04 5 176. 772. 744.
# 6 2018-02-02 07:45:05 6 176. 772. 744.
# 7 2018-02-02 07:45:06 7 176. 772. 744.
# 8 2018-02-02 07:45:07 8 176. 772. 744.
# 9 2018-02-02 07:45:08 9 176. 772. 744.
#10 2018-02-02 07:45:09 10 176. 772. 744.
#11 2018-02-02 07:45:10 11 112 2323 2323
If you check the column means for last 3 columns, you can see those value are accurately replaced.
colMeans(df[3:5])
# A B C
#175.75 772.50 744.25
data
df <- structure(list(ID = 1:4, Time_Stamp = structure(c(1517557500,
1517557501, 1517557502, 1517557510), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), A = c(123L, 234L, 234L, 112L), B = c(567L,
100L, 100L, 2323L), C = c(434L, 110L, 110L, 2323L)), class = "data.frame",
row.names = c(NA, -4L))
which looks like
df
# ID Time_Stamp A B C
#1 1 2018-02-02 07:45:00 123 567 434
#2 2 2018-02-02 07:45:01 234 100 110
#3 3 2018-02-02 07:45:02 234 100 110
#4 4 2018-02-02 07:45:10 112 2323 2323
Handling missing values in time series replacing with previous values
The following code works perfectly
df1<- df %>%
complete(Timestamp = seq(min(Timestamp), max(Timestamp), by = "sec")) %>%
fill(everything()) %>%
mutate(ID = row_number())
It adds missing data with the previous or last value before the missing data time is started.
Complete missing values in time series using previous day data - using R
If I understand correctly, the trick is that you want to fill downward except for the bottommost NAs. And the problem with tidyr
's fill
is that it goes all the way down.
This isn't a fully-tidyverse solution, but for this data:
library(dplyr)
library(tidyr)
data <- tribble(
~Date, ~time_series_1, ~time_series_2,
as.Date("2019-01-01"), NA, 10,
as.Date("2019-02-01"), 5, NA,
as.Date("2019-03-01"), 10, NA,
as.Date("2019-04-01"), 20, 6,
as.Date("2019-05-01"), 30, NA,
as.Date("2019-06-01"), NA, 8,
as.Date("2019-07-01"), 7, NA,
as.Date("2019-08-01"), 5, NA,
as.Date("2019-09-01"), NA, NA,
as.Date("2019-10-01"), NA, NA
)
You can determine the ending date for each time series separately:
LastTS1Date <- with( data, max(Date[!is.na(time_series_1)]))
LastTS2Date <- with( data, max(Date[!is.na(time_series_2)]))
And then use baseR filter syntax to only change the part of the data frame that goes up to those dates:
data[data$Date <= LastTS1Date,] <-
data[data$Date <= LastTS1Date,] %>% fill(time_series_1)
data[data$Date <= LastTS2Date,] <-
data[data$Date <= LastTS2Date,] %>% fill(time_series_2)
Time series missing value imputation: How to use maxgap inside na_kalman?
However I want to avoid the replacement of any NA series longer than 1 and set maxgap equals 1
For me running your code with maxgap = 1 does exactly this.
In the input series you can see multiple NA gaps. Mostly only 1 single NA and only one series of 3 consecutive NAs.
After applying na_kalman
with maxgap = 1
all single NAs get imputed as expected. The longer gap with 3 consecutive NAs is not changed.
Related Topics
How to Add My Outlook Email Signature to the Com Object Using Rdcomclient
Convert Time Object to Categorical (Morning, Afternoon, Evening, Night) Variable in R
Tooltip or Popover in Shiny Datatables for Row Names
How to Melt R Data.Frame and Plot Group by Bar Plot
Changing Class and Mode from Character to Numeric
R Plots: How to Draw a Border, Shadow or Buffer Around Text Labels
Positioning Shiny Widgets Beside Their Headers
How to Load a Matlab Struct into a R Data Frame
Adding Slight Curve (Or Bend) in Ggplot Geom_Path to Make Path Easier to Read
Trouble Installing and Loading Rjava on MAC El Capitan
How to Write Special Characters in Rmarkdown Latex Documents
Generate Rows Between Two Dates into a Data Frame in R
Transposition of a Tibble Using Pivot_Longer() and Pivot_Wider (Tidyverse)
Replace Nas in One Variable with Values from Another Variable