Handle Continuous Missing values in time-series data
The zoo
package has several functions for dealing with NA
values. One of the following functions might suit your needs:
na.locf
: Last observation carried forward. Using the parameterfromLast = TRUE
corresponds to next observation carried backward (NOCB).na.aggregate
: Replace theNA
's with some aggregated value. The default aggregation function is themean
, but you can specify other functions as well. See?na.aggregate
for more info.na.approx
:NA
's are replaced with linear interpolated values.
You can compare the outcomes to see what these functions do:
library(zoo)
df$V.loc <- na.locf(df$V2)
df$V.agg <- na.aggregate(df$V2)
df$V.app <- na.approx(df$V2)
this results in:
> df
V1 V2 V.loc V.agg V.app
1 2015-04-26 23:00:00 5704.27389 5704.27389 5704.27389 5704.27389
2 2015-04-27 00:00:00 4470.30868 4470.30868 4470.30868 4470.30868
3 2015-04-27 01:00:00 4552.57242 4552.57242 4552.57242 4552.57242
4 2015-04-27 02:00:00 4570.22250 4570.22250 4570.22250 4570.22250
5 2015-04-27 03:00:00 NA 4570.22250 5454.64894 6602.01119
6 2015-04-27 04:00:00 NA 4570.22250 5454.64894 8633.79987
7 2015-04-27 05:00:00 NA 4570.22250 5454.64894 10665.58856
8 2015-04-27 06:00:00 12697.37724 12697.37724 12697.37724 12697.37724
9 2015-04-27 07:00:00 5538.71119 5538.71119 5538.71119 5538.71119
10 2015-04-27 08:00:00 81.95061 81.95061 81.95061 81.95061
11 2015-04-27 09:00:00 8550.65817 8550.65817 8550.65817 8550.65817
12 2015-04-27 10:00:00 2925.76573 2925.76573 2925.76573 2925.76573
Used data:
df <- structure(list(V1 = structure(c(1430082000, 1430085600, 1430089200, 1430092800, 1430096400, 1430100000, 1430103600, 1430107200, 1430110800, 1430114400, 1430118000, 1430121600), class = c("POSIXct", "POSIXt"), tzone = ""), V2 = c(5704.27388916016, 4470.30868326823, 4552.57241617839, 4570.22250032826, NA, NA, NA, 12697.3772408622, 5538.71119009654, 81.950606473287, 8550.65816895301, 2925.76573206584)), .Names = c("V1", "V2"), row.names = c(NA, -12L), class = "data.frame")
Addition:
There are also additional time series functions for dealing with NAs in the imputeTS
and the forecast
package (also some more advanced functions).
For example:
library("imputeTS")
# Moving Average Imputation
na_ma(df$V2)
# Imputation via Kalman Smoothing on structural time series models
na_kalman(df$V2)
# Just interpolation but with some nice options (linear, spline,stine)
na_interpolation(df$V2)
or
library("forecast")
#Interpolation via seasonal decomposition and interpolation
na.interp(df$V2)
Handling missing values in time series
It is always better to have a specific example showing specific expected output so that there is little space for ambiguity and assumption. However, I have created a dummy data based on my understanding and tried to solve it accordingly.
If I have understood you correctly, you have time series data with data point every second but sometimes there are some seconds missing which you want to fill it with mean
of that column.
We can achieve this using complete
by generating a sequence for every second between the min
and max
Time_Stamp
and fill the missing values by the mean
in the respective column. ID
looks like an unique identifier for each row so filled it with row_number()
.
library(dplyr)
library(tidyr)
df %>%
complete(Time_Stamp = seq(min(Time_Stamp), max(Time_Stamp), by = "sec")) %>%
mutate_at(vars(A:C), ~replace(., is.na(.), mean(., na.rm = TRUE))) %>%
mutate(ID = row_number())
# A tibble: 11 x 5
# Time_Stamp ID A B C
# <dttm> <int> <dbl> <dbl> <dbl>
# 1 2018-02-02 07:45:00 1 123 567 434
# 2 2018-02-02 07:45:01 2 234 100 110
# 3 2018-02-02 07:45:02 3 234 100 110
# 4 2018-02-02 07:45:03 4 176. 772. 744.
# 5 2018-02-02 07:45:04 5 176. 772. 744.
# 6 2018-02-02 07:45:05 6 176. 772. 744.
# 7 2018-02-02 07:45:06 7 176. 772. 744.
# 8 2018-02-02 07:45:07 8 176. 772. 744.
# 9 2018-02-02 07:45:08 9 176. 772. 744.
#10 2018-02-02 07:45:09 10 176. 772. 744.
#11 2018-02-02 07:45:10 11 112 2323 2323
If you check the column means for last 3 columns, you can see those value are accurately replaced.
colMeans(df[3:5])
# A B C
#175.75 772.50 744.25
data
df <- structure(list(ID = 1:4, Time_Stamp = structure(c(1517557500,
1517557501, 1517557502, 1517557510), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), A = c(123L, 234L, 234L, 112L), B = c(567L,
100L, 100L, 2323L), C = c(434L, 110L, 110L, 2323L)), class = "data.frame",
row.names = c(NA, -4L))
which looks like
df
# ID Time_Stamp A B C
#1 1 2018-02-02 07:45:00 123 567 434
#2 2 2018-02-02 07:45:01 234 100 110
#3 3 2018-02-02 07:45:02 234 100 110
#4 4 2018-02-02 07:45:10 112 2323 2323
How to deal with consecutive missing values of stock price in a time series using python?
Depending on where your data come from, the missing data at a given time may mean that at this particular timestamp, out of the two stocks, an order was executed for one but not for the other. There is no reason in fact that two different stocks trade at exactly the same time. Certain dormant stocks with no liquidity can go for a long time without being traded while others are more active. Moreover, given that the precision of the data is down to the microsecond, no surprise that the trades on both stocks are not necessarily happening at the exact same microsecond. In this cases, it is safe to assume that the price of the stock was the last recorded transaction and update the missing values accordingly. Assuming you are using pandas, you could harmonize it by applying the pandas fillna method. Just make sure to sort your data frame beforehand:
df.sort_values('Time', inplace=True)
df['Series1'].fillna(method='ffill', inplace=True)
df['Series2'].fillna(method='ffill', inplace=True)
Handling missing values in time series replacing with previous values
The following code works perfectly
df1<- df %>%
complete(Timestamp = seq(min(Timestamp), max(Timestamp), by = "sec")) %>%
fill(everything()) %>%
mutate(ID = row_number())
It adds missing data with the previous or last value before the missing data time is started.
Fill NaN value to continuous time series data where some timeframe were missing
Create an empty dataframe with rng as index:
skeleton = pd.DataFrame(index=rng)
Convert the original dates to numpy.datetime64 to make them compatible with timerange:
df['datetime_ns'] = df['datetime'].astype(numpy.datetime64)
Perform an outer join of the frames on index and datetime_ns:
new_df = df.merge(skeleton, left_on='datetime_ns',right_index=True,how='outer')
Sort the new dataframe, if necessary:
new_df.sort_values('datetime_ns', inplace=True)
How to fill missing observations in time series data
Create DaetimeIndex
, then use DataFrame.asfreq
with rolling and mean
:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()
Counting continuous nan values in panda Time series
groupby
and agg
mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)
Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
Related Topics
How to Pass Multiple Arguments to a Function as a Single Vector
How to Plot a Normal Distribution by Labeling Specific Parts of the X-Axis
Convert Sequence of Longitude and Latitude to Polygon via Sf in R
How to Change the Now Deprecated Dplyr::Funs() Which Includes an Ifelse Argument
Dplyr Group by Colnames Described as Vector of Strings
Change the Number of Breaks Using Facet_Grid in Ggplot2
Consolidating Data Frames in R
Existing Function for Seeing If a Row Exists in a Data Frame
Apply Function to Elements Over a List
"'\W' Is an Unrecognized Escape" in Grep
Count the Number of Unique Characters in a String
Find Matching Strings Between Two Vectors in R
Is There a Fast Estimation of Simple Regression (A Regression Line with Only Intercept and Slope)
Weird Error in R When Importing (64-Bit) Integer with Many Digits
Error in Strsplit When Trying to Separate by a Comma
How to Italicize One Category in a Legend in Ggplot2