Fill NA in a time series only to a limited number
Here's another way:
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
# [1] 1 1 1 1 5 5 5 5 NA NA 11 12 12 12 12 NA NA NA 19 20
edit: my previous answer required that x
have no duplicates. The current answer does not.
benchmarks
x <- rep(x, length.out=1e4)
plourde <- function(x) {
l <- cumsum(! is.na(x))
c(NA, x[! is.na(x)])[replace(l, ave(l, l, FUN=seq_along) > 4, 0) + 1]
}
agstudy <- function(x) {
unlist(sapply(split(coredata(x),cumsum(!is.na(x))),
function(sx){
if(length(sx)>3)
sx[2:4] <- rep(sx[1],3)
else sx <- rep(sx[1],length(sx))
sx
}))
}
microbenchmark(plourde(x), agstudy(x))
# Unit: milliseconds
# expr min lq median uq max neval
# plourde(x) 5.30 5.591 6.409 6.774 57.13 100
# agstudy(x) 16.04 16.249 16.454 17.516 20.64 100
Fill NAs in a time series for specific number of dates only
rollapplyr
in zoo can take a vector width. We can set each of its elements to be the number of points from the current point to a month ago (i.e. the current position minus the position a month ago) and then rollapplyr
can apply na.locf0
to only the last month.
library(lubridate)
library(xts)
tt <- time(tempY)
w <- seq_along(tt) - findInterval(tt %m-% months(1), tt)
Fill <- function(x) {
last <- tail(x, 1)
if (is.na(last)) tail(na.locf0(x), 1) else last
}
rollapplyr(as.zoo(tempY), w, Fill)
How to fill down values with limit in R?
One potential solution is to use vec_fill_missing()
from the vctrs package which has a "max_fill" option:
library(tidyverse)
library(vctrs)
df <- read.table(text = "DATE ELE.CN
2009-06-30 00:00:00 115942928608
2009-06-28 00:00:00 115942928608
2009-06-27 00:00:00 115942928608
2009-06-26 00:00:00 115942928608
2009-06-24 00:00:00 NA
2009-06-23 00:00:00 NA
2009-06-21 00:00:00 NA
2009-06-20 00:00:00 NA
2009-06-19 00:00:00 NA
2009-06-17 00:00:00 NA", header = TRUE)
df
#> DATE ELE.CN
#> 2009-06-30 00:00:00 115942928608
#> 2009-06-28 00:00:00 115942928608
#> 2009-06-27 00:00:00 115942928608
#> 2009-06-26 00:00:00 115942928608
#> 2009-06-24 00:00:00 NA
#> 2009-06-23 00:00:00 NA
#> 2009-06-21 00:00:00 NA
#> 2009-06-20 00:00:00 NA
#> 2009-06-19 00:00:00 NA
#> 2009-06-17 00:00:00 NA
df %>%
mutate(ELE.CN = vec_fill_missing(ELE.CN, max_fill = 3))
#> DATE ELE.CN
#> 2009-06-30 00:00:00 115942928608
#> 2009-06-28 00:00:00 115942928608
#> 2009-06-27 00:00:00 115942928608
#> 2009-06-26 00:00:00 115942928608
#> 2009-06-24 00:00:00 115942928608
#> 2009-06-23 00:00:00 115942928608
#> 2009-06-21 00:00:00 115942928608
#> 2009-06-20 00:00:00 NA
#> 2009-06-19 00:00:00 NA
#> 2009-06-17 00:00:00 NA
Created on 2022-07-14 by the reprex package (v2.0.1)
How to fillna limited by date in a groupby
You can group
the dataframe on columns Security
and ID
along with an additional grouper
for column day
with frequency set to 60 days
then use ffill
to forward fill the values for the next 60 days
:
g = pd.Grouper(key='day', freq='60d')
df.assign(**df.groupby(["Security","ID", g]).ffill())
day TradeID Security ID BSType Price Quantity
0 2020-01-01 01 GOOGLE ID001 B 105.901 1000000.0
1 2020-01-03 02 GOOGLE ID001 S 106.969 -300000.0
2 2020-01-04 03 APPLE ID001 B NaN NaN
3 2020-01-05 04 GOOGLE ID001 B 107.037 7500000.0
4 2020-01-06 05 GOOGLE ID001 B 107.038 100000.0
5 2020-01-07 06 GOOGLE ID001 S 107.136 -100000.0
6 2020-01-08 07 GOOGLE ID001 S 107.136 -100000.0
7 2020-01-08 08 GOOGLE ID001 S 107.250 -7800000.0
8 2020-06-09 09 GOOGLE ID001 B NaN NaN
Backward replacement of NAs in time series only to a limited number of observations
Hacky, but why not just flip your column?
Code
# Using your result as basis
dt$Value <- rev(dt$Value)
dt$backward <- NA
r <- rle(is.na(dt$Value))
dt$backward <- na.locf(dt$Value, fromLast = F, na.rm = F)
is.na(dt$backward) <- sequence(r$lengths) > 3 & rep(r$values, r$lengths)
dt$Value <- rev(dt$Value)
dt$backward <- rev(dt$backward)
Result
> dt
Value forward backward
1: NA NA NA
2: NA NA NA
3: NA NA NA
4: NA NA NA
5: NA NA NA
6: NA NA NA
7: NA NA 0.1359223
8: NA NA 0.1359223
9: NA NA 0.1359223
10: 0.1359223 0.1359223 0.1359223
11: NA 0.1359223 NA
12: NA 0.1359223 0.0000000
13: NA 0.1359223 0.0000000
14: NA NA 0.0000000
15: 0.0000000 0.0000000 0.0000000
16: 0.0000000 0.0000000 0.0000000
17: 0.0000000 0.0000000 0.0000000
18: 0.0000000 0.0000000 0.0000000
19: 0.0000000 0.0000000 0.0000000
20: NA 0.0000000 NA
How to fill nan values from a specific date range in a python time series?
IIUC, you can use simple indexing:
# if needed, convert to datetime
#df.index = pd.to_datetime(df.index)
df.loc[df.index.month==5, 'min'] = df.loc[df.index.month.isin([4,6]), 'min'].mean()
or if you have non NaN for the 5th month:
mask = df.index.month==5
df.loc[mask, 'min'] = (df.loc[mask, 'min']
.fillna(df.loc[df.index.month.isin([4,6]), 'min'].mean())
)
output:
min
date
2013-04-01 12:00:00 16.00
2013-04-02 12:00:00 16.00
2013-05-22 12:00:00 15.88
2013-05-23 12:00:00 15.88
2013-05-24 12:00:00 15.88
2013-05-27 12:00:00 15.88
2013-05-28 12:00:00 15.88
2013-05-29 12:00:00 15.88
2013-05-30 12:00:00 15.88
2013-05-31 12:00:00 15.88
2013-06-03 12:00:00 NaN
2013-06-04 12:00:00 NaN
2013-06-05 12:00:00 NaN
2013-06-06 12:00:00 NaN
2013-06-07 12:00:00 NaN
2013-06-10 12:00:00 NaN
2013-06-11 12:00:00 NaN
2013-06-12 12:00:00 NaN
2013-06-13 12:00:00 NaN
2013-06-14 12:00:00 NaN
2013-06-17 12:00:00 NaN
2013-06-18 12:00:00 NaN
2013-06-19 12:00:00 15.80
2013-06-20 12:00:00 15.80
2013-06-21 12:00:00 15.80
Missing values in Time Series in python
Consider interpolate
(Series - DataFrame). This example shows how to fill gaps of any size with a straight line:
df = pd.DataFrame({'date': pd.date_range(start='2013-01-01', periods=10, freq='H'), 'value': range(10)})
df.loc[2:3, 'value'] = np.nan
df.loc[6, 'value'] = np.nan
df
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 NaN
3 2013-01-01 03:00:00 NaN
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 NaN
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0
df['value'].interpolate(method='linear', inplace=True)
date value
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 1.0
2 2013-01-01 02:00:00 2.0
3 2013-01-01 03:00:00 3.0
4 2013-01-01 04:00:00 4.0
5 2013-01-01 05:00:00 5.0
6 2013-01-01 06:00:00 6.0
7 2013-01-01 07:00:00 7.0
8 2013-01-01 08:00:00 8.0
9 2013-01-01 09:00:00 9.0
Replace NA with previous value with limit on number of consecutive NA
Here's an option using na.locf
and rle
library(zoo)
r <- rle(is.na(a))
a <- na.locf(a)
is.na(a) <- sequence(r$lengths) > n & rep(r$values, r$lengths)
a
# [1] 1 1 1 NA NA NA 2 2 1 1 1 NA
So here I first computed the run lengths of elements in a
(including the NA entries), then replaced all NA's using na.locf
and finally turned those elements back to NA's where the run lengths were greater than n
and the elements were NA
.
Related Topics
Avoid String Printed to Console Getting Truncated (In Rstudio)
Equivalent to Unix "Less" Command Within R Console
Handling Dates When We Switch to Daylight Savings Time and Back
Merge by Range in R - Applying Loops
Delete "" from CSV Values and Change Column Names When Writing to a CSV
Create Frequency Tables for Multiple Factor Columns in R
Adding New Columns to a Data.Table By-Reference Within a Function Not Always Working
Generate Paired Stacked Bar Charts in Ggplot (Using Position_Dodge Only on Some Variables)
Directly Creating Dummy Variable Set in a Sparse Matrix in R
Data.Frame Without Ruining Column Names
Replace Values in a Vector Based on Another Vector
Group Integer Vector into Consecutive Runs
Dplyr::Mutate to Add Multiple Values
How to Change Python Path in Reticulate
Rvest Error in Open.Connection(X, "Rb"):Timeout Was Reached
Insert Blanks into a Vector For, E.G., Minor Tick Labels in R