Fill Na in a Time Series Only to a Limited Number

Fill NA in a time series only to a limited number

Here's another way:

l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20

edit: my previous answer required that x have no duplicates. The current answer does not.

benchmarks

x <- rep(x, length.out=1e4)

plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}

agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}

microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100

Fill NAs in a time series for specific number of dates only

rollapplyr in zoo can take a vector width. We can set each of its elements to be the number of points from the current point to a month ago (i.e. the current position minus the position a month ago) and then rollapplyr can apply na.locf0 to only the last month.

library(lubridate)
library(xts)

tt <- time(tempY)
w <- seq_along(tt) - findInterval(tt %m-% months(1), tt)
Fill <- function(x) {
last <- tail(x, 1)
if (is.na(last)) tail(na.locf0(x), 1) else last
}

rollapplyr(as.zoo(tempY), w, Fill)

How to fill down values with limit in R?

One potential solution is to use vec_fill_missing() from the vctrs package which has a "max_fill" option:

library(tidyverse)
library(vctrs)

df <- read.table(text = "DATE ELE.CN
2009-06-30 00:00:00 115942928608
2009-06-28 00:00:00 115942928608
2009-06-27 00:00:00 115942928608
2009-06-26 00:00:00 115942928608
2009-06-24 00:00:00 NA
2009-06-23 00:00:00 NA
2009-06-21 00:00:00 NA
2009-06-20 00:00:00 NA
2009-06-19 00:00:00 NA
2009-06-17 00:00:00 NA", header = TRUE)
df
#> DATE ELE.CN
#> 2009-06-30 00:00:00 115942928608
#> 2009-06-28 00:00:00 115942928608
#> 2009-06-27 00:00:00 115942928608
#> 2009-06-26 00:00:00 115942928608
#> 2009-06-24 00:00:00 NA
#> 2009-06-23 00:00:00 NA
#> 2009-06-21 00:00:00 NA
#> 2009-06-20 00:00:00 NA
#> 2009-06-19 00:00:00 NA
#> 2009-06-17 00:00:00 NA

df %>%
mutate(ELE.CN = vec_fill_missing(ELE.CN, max_fill = 3))
#> DATE ELE.CN
#> 2009-06-30 00:00:00 115942928608
#> 2009-06-28 00:00:00 115942928608
#> 2009-06-27 00:00:00 115942928608
#> 2009-06-26 00:00:00 115942928608
#> 2009-06-24 00:00:00 115942928608
#> 2009-06-23 00:00:00 115942928608
#> 2009-06-21 00:00:00 115942928608
#> 2009-06-20 00:00:00 NA
#> 2009-06-19 00:00:00 NA
#> 2009-06-17 00:00:00 NA

Created on 2022-07-14 by the reprex package (v2.0.1)

How to fillna limited by date in a groupby

You can group the dataframe on columns Security and ID along with an additional grouper for column day with frequency set to 60 days then use ffill to forward fill the values for the next 60 days:

g = pd.Grouper(key='day', freq='60d')
df.assign(**df.groupby(["Security","ID", g]).ffill())


         day TradeID Security     ID BSType    Price   Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S 107.136 -100000.0
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN

Backward replacement of NAs in time series only to a limited number of observations


Hacky, but why not just flip your column?

Code

# Using your result as basis
dt$Value <- rev(dt$Value)
dt$backward <- NA
r <- rle(is.na(dt$Value))
dt$backward <- na.locf(dt$Value, fromLast = F, na.rm = F)
is.na(dt$backward) <- sequence(r$lengths) > 3 & rep(r$values, r$lengths)
dt$Value <- rev(dt$Value)
dt$backward <- rev(dt$backward)

Result

> dt
Value forward backward
1: NA NA NA
2: NA NA NA
3: NA NA NA
4: NA NA NA
5: NA NA NA
6: NA NA NA
7: NA NA 0.1359223
8: NA NA 0.1359223
9: NA NA 0.1359223
10: 0.1359223 0.1359223 0.1359223
11: NA 0.1359223 NA
12: NA 0.1359223 0.0000000
13: NA 0.1359223 0.0000000
14: NA NA 0.0000000
15: 0.0000000 0.0000000 0.0000000
16: 0.0000000 0.0000000 0.0000000
17: 0.0000000 0.0000000 0.0000000
18: 0.0000000 0.0000000 0.0000000
19: 0.0000000 0.0000000 0.0000000
20: NA 0.0000000 NA

How to fill nan values from a specific date range in a python time series?

IIUC, you can use simple indexing:

# if needed, convert to datetime
#df.index = pd.to_datetime(df.index)

df.loc[df.index.month==5, 'min'] = df.loc[df.index.month.isin([4,6]), 'min'].mean()

or if you have non NaN for the 5th month:

mask = df.index.month==5
df.loc[mask, 'min'] = (df.loc[mask, 'min']
.fillna(df.loc[df.index.month.isin([4,6]), 'min'].mean())
)

output:

                       min
date
2013-04-01 12:00:00 16.00
2013-04-02 12:00:00 16.00
2013-05-22 12:00:00 15.88
2013-05-23 12:00:00 15.88
2013-05-24 12:00:00 15.88
2013-05-27 12:00:00 15.88
2013-05-28 12:00:00 15.88
2013-05-29 12:00:00 15.88
2013-05-30 12:00:00 15.88
2013-05-31 12:00:00 15.88
2013-06-03 12:00:00 NaN
2013-06-04 12:00:00 NaN
2013-06-05 12:00:00 NaN
2013-06-06 12:00:00 NaN
2013-06-07 12:00:00 NaN
2013-06-10 12:00:00 NaN
2013-06-11 12:00:00 NaN
2013-06-12 12:00:00 NaN
2013-06-13 12:00:00 NaN
2013-06-14 12:00:00 NaN
2013-06-17 12:00:00 NaN
2013-06-18 12:00:00 NaN
2013-06-19 12:00:00 15.80
2013-06-20 12:00:00 15.80
2013-06-21 12:00:00 15.80

Missing values in Time Series in python

Consider interpolate (Series - DataFrame). This example shows how to fill gaps of any size with a straight line:

df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=10, freq='H'), 'value': range(10)})
df.loc[2:3, 'value'] = np.nan
df.loc[6, 'value'] = np.nan
df
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 NaN
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0

df['value'].interpolate(method='linear', inplace=True)
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 2.0
3 2013-01-01 03:00:00 3.0
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 6.0
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0

Replace NA with previous value with limit on number of consecutive NA

Here's an option using na.locf and rle

library(zoo)
r <- rle(is.na(a))
a <- na.locf(a)
is.na(a) <- sequence(r$lengths) > n & rep(r$values, r$lengths)
a
# [1] 1 1 1 NA NA NA 2 2 1 1 1 NA

So here I first computed the run lengths of elements in a (including the NA entries), then replaced all NA's using na.locf and finally turned those elements back to NA's where the run lengths were greater than n and the elements were NA.



Related Topics



Leave a reply



Submit