Handle Continuous Missing Values in Time-Series Data

Handle Continuous Missing values in time-series data

The zoo package has several functions for dealing with NA values. One of the following functions might suit your needs:

  • na.locf: Last observation carried forward. Using the parameter fromLast = TRUE corresponds to next observation carried backward (NOCB).
  • na.aggregate: Replace the NA's with some aggregated value. The default aggregation function is the mean, but you can specify other functions as well. See ?na.aggregate for more info.
  • na.approx: NA's are replaced with linear interpolated values.

You can compare the outcomes to see what these functions do:

library(zoo)
df$V.loc <- na.locf(df$V2)
df$V.agg <- na.aggregate(df$V2)
df$V.app <- na.approx(df$V2)

this results in:

> df
V1 V2 V.loc V.agg V.app
1 2015-04-26 23:00:00 5704.27389 5704.27389 5704.27389 5704.27389
2 2015-04-27 00:00:00 4470.30868 4470.30868 4470.30868 4470.30868
3 2015-04-27 01:00:00 4552.57242 4552.57242 4552.57242 4552.57242
4 2015-04-27 02:00:00 4570.22250 4570.22250 4570.22250 4570.22250
5 2015-04-27 03:00:00 NA 4570.22250 5454.64894 6602.01119
6 2015-04-27 04:00:00 NA 4570.22250 5454.64894 8633.79987
7 2015-04-27 05:00:00 NA 4570.22250 5454.64894 10665.58856
8 2015-04-27 06:00:00 12697.37724 12697.37724 12697.37724 12697.37724
9 2015-04-27 07:00:00 5538.71119 5538.71119 5538.71119 5538.71119
10 2015-04-27 08:00:00 81.95061 81.95061 81.95061 81.95061
11 2015-04-27 09:00:00 8550.65817 8550.65817 8550.65817 8550.65817
12 2015-04-27 10:00:00 2925.76573 2925.76573 2925.76573 2925.76573

Used data:

df <- structure(list(V1 = structure(c(1430082000, 1430085600, 1430089200, 1430092800, 1430096400, 1430100000, 1430103600, 1430107200, 1430110800, 1430114400, 1430118000, 1430121600), class = c("POSIXct", "POSIXt"), tzone = ""), V2 = c(5704.27388916016, 4470.30868326823, 4552.57241617839, 4570.22250032826, NA, NA, NA, 12697.3772408622, 5538.71119009654, 81.950606473287, 8550.65816895301, 2925.76573206584)), .Names = c("V1", "V2"), row.names = c(NA, -12L), class = "data.frame")

Addition:

There are also additional time series functions for dealing with NAs in the imputeTS and the forecast package (also some more advanced functions).

For example:

 library("imputeTS")

# Moving Average Imputation
na_ma(df$V2)

# Imputation via Kalman Smoothing on structural time series models
na_kalman(df$V2)

# Just interpolation but with some nice options (linear, spline,stine)
na_interpolation(df$V2)

or

library("forecast")

#Interpolation via seasonal decomposition and interpolation
na.interp(df$V2)

Handling missing values in time series

It is always better to have a specific example showing specific expected output so that there is little space for ambiguity and assumption. However, I have created a dummy data based on my understanding and tried to solve it accordingly.

If I have understood you correctly, you have time series data with data point every second but sometimes there are some seconds missing which you want to fill it with mean of that column.

We can achieve this using complete by generating a sequence for every second between the min and max Time_Stamp and fill the missing values by the mean in the respective column. ID looks like an unique identifier for each row so filled it with row_number().

library(dplyr)
library(tidyr)

df %>%
complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>%
mutate_at(vars(A:C), ~replace(., is.na(.), mean(., na.rm = TRUE))) %>%
mutate(ID = row_number())

# A tibble: 11 x 5
# Time_Stamp ID A B C
# <dttm> <int> <dbl> <dbl> <dbl>
# 1 2018-02-02 07:45:00 1 123 567 434
# 2 2018-02-02 07:45:01 2 234 100 110
# 3 2018-02-02 07:45:02 3 234 100 110
# 4 2018-02-02 07:45:03 4 176. 772. 744.
# 5 2018-02-02 07:45:04 5 176. 772. 744.
# 6 2018-02-02 07:45:05 6 176. 772. 744.
# 7 2018-02-02 07:45:06 7 176. 772. 744.
# 8 2018-02-02 07:45:07 8 176. 772. 744.
# 9 2018-02-02 07:45:08 9 176. 772. 744.
#10 2018-02-02 07:45:09 10 176. 772. 744.
#11 2018-02-02 07:45:10 11 112 2323 2323

If you check the column means for last 3 columns, you can see those value are accurately replaced.

colMeans(df[3:5])
# A B C
#175.75 772.50 744.25

data

df <- structure(list(ID = 1:4, Time_Stamp = structure(c(1517557500, 
1517557501, 1517557502, 1517557510), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), A = c(123L, 234L, 234L, 112L), B = c(567L,
100L, 100L, 2323L), C = c(434L, 110L, 110L, 2323L)), class = "data.frame",
row.names = c(NA, -4L))

which looks like

df

# ID Time_Stamp A B C
#1 1 2018-02-02 07:45:00 123 567 434
#2 2 2018-02-02 07:45:01 234 100 110
#3 3 2018-02-02 07:45:02 234 100 110
#4 4 2018-02-02 07:45:10 112 2323 2323

How to deal with consecutive missing values of stock price in a time series using python?

Depending on where your data come from, the missing data at a given time may mean that at this particular timestamp, out of the two stocks, an order was executed for one but not for the other. There is no reason in fact that two different stocks trade at exactly the same time. Certain dormant stocks with no liquidity can go for a long time without being traded while others are more active. Moreover, given that the precision of the data is down to the microsecond, no surprise that the trades on both stocks are not necessarily happening at the exact same microsecond. In this cases, it is safe to assume that the price of the stock was the last recorded transaction and update the missing values accordingly. Assuming you are using pandas, you could harmonize it by applying the pandas fillna method. Just make sure to sort your data frame beforehand:

df.sort_values('Time', inplace=True)
df['Series1'].fillna(method='ffill', inplace=True)
df['Series2'].fillna(method='ffill', inplace=True)

Handling missing values in time series replacing with previous values

The following code works perfectly

 df1<- df %>%
complete(Timestamp = seq(min(Timestamp), max(Timestamp), by = "sec")) %>%
fill(everything()) %>%
mutate(ID = row_number())

It adds missing data with the previous or last value before the missing data time is started.

Fill NaN value to continuous time series data where some timeframe were missing

Create an empty dataframe with rng as index:

skeleton = pd.DataFrame(index=rng)

Convert the original dates to numpy.datetime64 to make them compatible with timerange:

df['datetime_ns'] = df['datetime'].astype(numpy.datetime64)

Perform an outer join of the frames on index and datetime_ns:

new_df = df.merge(skeleton, left_on='datetime_ns',right_index=True,how='outer')

Sort the new dataframe, if necessary:

new_df.sort_values('datetime_ns', inplace=True)

How to fill missing observations in time series data

Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:

df['date'] = pd.to_datetime(df['date'])

df = df.set_index('date').asfreq('d').rolling('7D').mean()

If need all values by year use:

df['date'] = pd.to_datetime(df['date'])

idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()

Counting continuous nan values in panda Time series

groupby and agg

mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)

Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1


Related Topics



Leave a reply



Submit