How to Replace Na (Missing Values) in a Data Frame with Neighbouring Values

How to replace NA (missing values) in a data frame with neighbouring values

Properly formatted your data looks like this

862 2006-05-19 6.241603 5.774208 
863 2006-05-20 NA       NA 
864 2006-05-21 NA       NA 
865 2006-05-22 6.383929 5.906426 
866 2006-05-23 6.782068 6.268758 
867 2006-05-24 6.534616 6.013767 
868 2006-05-25 6.370312 5.856366 
869 2006-05-26 6.225175 5.781617 
870 2006-05-27 NA       NA

and is of a time-series nature. So I would load into an object of class zoo (from the zoo package) as that allows you to pick a number of strategies -- see below. Which one you pick depends on the nature of your data and application. In general, the field of 'figuring missing data out' is called data imputation
and there is a rather large literature.

R> x <- zoo(X[,3:4], order.by=as.Date(X[,2]))
R> x
               x     y
2006-05-19 6.242 5.774
2006-05-20    NA    NA
2006-05-21    NA    NA
2006-05-22 6.384 5.906
2006-05-23 6.782 6.269
2006-05-24 6.535 6.014
2006-05-25 6.370 5.856
2006-05-26 6.225 5.782
2006-05-27    NA    NA
R> na.locf(x)  # last observation carried forward
               x     y
2006-05-19 6.242 5.774
2006-05-20 6.242 5.774
2006-05-21 6.242 5.774
2006-05-22 6.384 5.906
2006-05-23 6.782 6.269
2006-05-24 6.535 6.014
2006-05-25 6.370 5.856
2006-05-26 6.225 5.782
2006-05-27 6.225 5.782
R> na.approx(x)  # approximation based on before/after values
               x     y
2006-05-19 6.242 5.774
2006-05-20 6.289 5.818
2006-05-21 6.336 5.862
2006-05-22 6.384 5.906
2006-05-23 6.782 6.269
2006-05-24 6.535 6.014
2006-05-25 6.370 5.856
2006-05-26 6.225 5.782
R> na.spline(x)   # spline fit ...
               x     y
2006-05-19 6.242 5.774
2006-05-20 5.585 5.159
2006-05-21 5.797 5.358
2006-05-22 6.384 5.906
2006-05-23 6.782 6.269
2006-05-24 6.535 6.014
2006-05-25 6.370 5.856
2006-05-26 6.225 5.782
2006-05-27 5.973 5.716
R>

Replace NA with the nearest value based on another variable, while keeping NA for observation which doesn't have non-missing neighbour

One option would be to make use of case_when from tidyverse. Essentially, if the previous row has a closer year and is not NA, then return x from that row. If not, then choose the row below. Or if the year is closer above but there is an NA, then return the row below. Then, same for if the row below has a closer year, but has an NA, then return the row above. If a row does not have an NA, then just return x.

library(tidyverse)

dat %>%
  mutate(x = case_when(is.na(x) & !is.na(lag(x)) & year - lag(year) < lead(year) - year ~ lag(x),
                       is.na(x) & !is.na(lead(x)) & year - lag(year) > lead(year) - year ~ lead(x),
                       is.na(x) & is.na(lag(x)) ~ lead(x),
                       is.na(x) & is.na(lead(x)) ~ lag(x),
                       TRUE ~ x))

Output

Replacing NA values with values from neighbouring rows

You may use complete with group_by as -

library(dplyr)
library(tidyr)

df %>%
  group_by(investor) %>%
  complete(dealyear = min(dealyear):max(dealyear), 
           fill = list(dealcounts = 0)) %>%
  ungroup

#  investor dealyear dealcounts strategy region
#  <chr>       <int>      <dbl> <chr>    <chr> 
#1 123IM        2002          5 buyout   europe
#2 123IM        2003          5 buyout   europe
#3 123IM        2004          0 NA       NA    
#4 123IM        2005          5 buyout   europe
#5 123IM        2006          5 buyout   europe

If you want to replace NA in strategy and region column you may use fill.

df %>%
  group_by(investor) %>%
  complete(dealyear = min(dealyear):max(dealyear), 
           fill = list(dealcounts = 0)) %>%
  fill(strategy, region) %>%
  ungroup

#  investor dealyear dealcounts strategy region
#  <chr>       <int>      <dbl> <chr>    <chr> 
#1 123IM        2002          5 buyout   europe
#2 123IM        2003          5 buyout   europe
#3 123IM        2004          0 buyout   europe
#4 123IM        2005          5 buyout   europe
#5 123IM        2006          5 buyout   europe

Replace a value NA with the value from another column in R

Perhaps the easiest to read/understand answer in R lexicon is to use ifelse. So borrowing Richard's dataframe we could do:

df <- structure(list(A = c(56L, NA, NA, 67L, NA),
                     B = c(75L, 45L, 77L, 41L, 65L),
                     Year = c(1921L, 1921L, 1922L, 1923L, 1923L)),.Names = c("A", 
                                                                                                                            "B", "Year"), class = "data.frame", row.names = c(NA, -5L))
df$A <- ifelse(is.na(df$A), df$B, df$A)

Replace NA in a POSIXct serie by adjacent values

I believe this solves your problem:

library(tidyr)

na_inds_begin <- as.numeric((is.na(df$begin)))
na_inds_end <- as.numeric((is.na(df$end)))

na_diffs_lead <- c(0, diff(na_inds_begin))
na_diffs_lag <- c(diff(na_inds_end), 0)

first_nas <- na_inds_begin == 1 & na_diffs_lead > 0
first_nas[1] <- na_inds_begin[1] == 1

last_nas <- na_inds_end == 1 & na_diffs_lag < 0 
last_nas[length(last_nas)] <- na_inds_end[length(na_inds_end)] == 1

df$begin[first_nas] <- df$date_time[first_nas]
df$end[last_nas] <- df$date_time[last_nas]

df$begin[first_nas] <- df$date_time[first_nas]
df$end[last_nas] <- df$date_time[last_nas]

df <-
  df %>%
  fill(begin, .direction = "down") %>%
  fill(end, .direction = "up")

First, we find the first NA in each group of NAs in begin, and the last NA in each group of NAs in end. We also need to handle cases where the first element in begin or the last element in end are NA. Then we replace only those elements with the desired replacements. Finally, we fill the rest of each group downward for begin and upward for end.

This is the result:

> df
# A tibble: 5 x 4
  individ_id  date_time           begin               end                
  <chr>       <dttm>              <dttm>              <dttm>             
1 NOS_4214433 2017-11-22 09:01:49 2017-11-21 11:54:59 2017-11-22 09:07:27
2 NOS_4214433 2017-11-22 09:06:49 2017-11-21 11:54:59 2017-11-22 09:07:27
3 NOS_4214433 2017-11-22 09:11:49 2017-11-22 09:11:49 2017-11-22 09:16:49
4 NOS_4214433 2017-11-22 09:16:49 2017-11-22 09:11:49 2017-11-22 09:16:49
5 NOS_4214433 2018-01-24 12:12:18 2018-01-24 12:08:28 2018-01-25 09:33:10

Edit: I updated the example code to be robust to the case where begin and end have different NA indices or the first/last elements are NA.

Replace missing values (NA) with most recent non-NA by group

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

   houseID year price
1        1 1995    NA
2        1 1996   100
3        1 1997   100
4        1 1998   120
5        1 1999   120
6        2 1995    NA
7        2 1996    NA
8        2 1997    NA
9        2 1998    30
10       2 1999    30
11       3 1995    NA
12       3 1996    44
13       3 1997    44
14       3 1998    44
15       3 1999    44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995  NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
  2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L, 
  1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L, 
  1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA, 
  30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year", 
  "price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

Replace NA in column with value in adjacent column

It didn't work because status was a factor. When you mix factor with numeric then numeric is the least restrictive. By forcing status to be character you get the results you're after and the column is now a character vector:

TEST$UNIT[is.na(TEST$UNIT)] <- as.character(TEST$STATUS[is.na(TEST$UNIT)])

##        UNIT   STATUS TERMINATED      START       STOP
## 1    ACTIVE   ACTIVE 1999-07-06 2007-04-23 2008-12-05
## 2  INACTIVE INACTIVE 2008-12-05 2008-12-06 4712-12-31
## 3       200   ACTIVE 2000-08-18 2004-06-01 2007-01-31
## 4       200   ACTIVE 2000-08-18 2007-02-01 2008-04-18
## 5       200 INACTIVE 2000-08-18 2008-04-19 2010-11-28
## 6       200   ACTIVE 2008-08-18 2010-11-29 2010-12-29
## 7       200 INACTIVE 2008-08-18 2010-12-30 4712-12-31
## 8       300   ACTIVE 2006-09-19 2007-10-29 2008-02-04
## 9       300   ACTIVE 2006-09-19 2008-02-05 2008-06-29
## 10      300   ACTIVE 2006-09-19 2008-06-30 2009-02-06
## 11      300 INACTIVE 1999-03-15 2009-02-07 4712-12-31

How to Replace Na (Missing Values) in a Data Frame with Neighbouring Values