How to Replace Na (Missing Values) in a Data Frame with Neighbouring Values

How to replace NA (missing values) in a data frame with neighbouring values

Properly formatted your data looks like this

862 2006-05-19 6.241603 5.774208 
863 2006-05-20 NA NA
864 2006-05-21 NA NA
865 2006-05-22 6.383929 5.906426
866 2006-05-23 6.782068 6.268758
867 2006-05-24 6.534616 6.013767
868 2006-05-25 6.370312 5.856366
869 2006-05-26 6.225175 5.781617
870 2006-05-27 NA NA

and is of a time-series nature. So I would load into an object of class zoo (from the zoo package) as that allows you to pick a number of strategies -- see below. Which one you pick depends on the nature of your data and application. In general, the field of 'figuring missing data out' is called data imputation
and there is a rather large literature.

R> x <- zoo(X[,3:4], order.by=as.Date(X[,2]))
R> x
x y
2006-05-19 6.242 5.774
2006-05-20 NA NA
2006-05-21 NA NA
2006-05-22 6.384 5.906
2006-05-23 6.782 6.269
2006-05-24 6.535 6.014
2006-05-25 6.370 5.856
2006-05-26 6.225 5.782
2006-05-27 NA NA
R> na.locf(x) # last observation carried forward
x y
2006-05-19 6.242 5.774
2006-05-20 6.242 5.774
2006-05-21 6.242 5.774
2006-05-22 6.384 5.906
2006-05-23 6.782 6.269
2006-05-24 6.535 6.014
2006-05-25 6.370 5.856
2006-05-26 6.225 5.782
2006-05-27 6.225 5.782
R> na.approx(x) # approximation based on before/after values
x y
2006-05-19 6.242 5.774
2006-05-20 6.289 5.818
2006-05-21 6.336 5.862
2006-05-22 6.384 5.906
2006-05-23 6.782 6.269
2006-05-24 6.535 6.014
2006-05-25 6.370 5.856
2006-05-26 6.225 5.782
R> na.spline(x) # spline fit ...
x y
2006-05-19 6.242 5.774
2006-05-20 5.585 5.159
2006-05-21 5.797 5.358
2006-05-22 6.384 5.906
2006-05-23 6.782 6.269
2006-05-24 6.535 6.014
2006-05-25 6.370 5.856
2006-05-26 6.225 5.782
2006-05-27 5.973 5.716
R>

Replace NA with the nearest value based on another variable, while keeping NA for observation which doesn't have non-missing neighbour

One option would be to make use of case_when from tidyverse. Essentially, if the previous row has a closer year and is not NA, then return x from that row. If not, then choose the row below. Or if the year is closer above but there is an NA, then return the row below. Then, same for if the row below has a closer year, but has an NA, then return the row above. If a row does not have an NA, then just return x.

library(tidyverse)

dat %>%
mutate(x = case_when(is.na(x) & !is.na(lag(x)) & year - lag(year) < lead(year) - year ~ lag(x),
is.na(x) & !is.na(lead(x)) & year - lag(year) > lead(year) - year ~ lead(x),
is.na(x) & is.na(lag(x)) ~ lead(x),
is.na(x) & is.na(lead(x)) ~ lag(x),
TRUE ~ x))

Output

   year  x
1 2000 1
2 2001 2
3 2002 3
4 2003 3
5 2005 5
6 2006 5
7 2007 NA
8 2008 9
9 2009 9
10 2010 10

Replacing NA values with values from neighbouring rows

You may use complete with group_by as -

library(dplyr)
library(tidyr)

df %>%
group_by(investor) %>%
complete(dealyear = min(dealyear):max(dealyear),
fill = list(dealcounts = 0)) %>%
ungroup

# investor dealyear dealcounts strategy region
# <chr> <int> <dbl> <chr> <chr>
#1 123IM 2002 5 buyout europe
#2 123IM 2003 5 buyout europe
#3 123IM 2004 0 NA NA
#4 123IM 2005 5 buyout europe
#5 123IM 2006 5 buyout europe

If you want to replace NA in strategy and region column you may use fill.

df %>%
group_by(investor) %>%
complete(dealyear = min(dealyear):max(dealyear),
fill = list(dealcounts = 0)) %>%
fill(strategy, region) %>%
ungroup

# investor dealyear dealcounts strategy region
# <chr> <int> <dbl> <chr> <chr>
#1 123IM 2002 5 buyout europe
#2 123IM 2003 5 buyout europe
#3 123IM 2004 0 buyout europe
#4 123IM 2005 5 buyout europe
#5 123IM 2006 5 buyout europe

Replace a value NA with the value from another column in R

Perhaps the easiest to read/understand answer in R lexicon is to use ifelse. So borrowing Richard's dataframe we could do:

df <- structure(list(A = c(56L, NA, NA, 67L, NA),
B = c(75L, 45L, 77L, 41L, 65L),
Year = c(1921L, 1921L, 1922L, 1923L, 1923L)),.Names = c("A",
"B", "Year"), class = "data.frame", row.names = c(NA, -5L))
df$A <- ifelse(is.na(df$A), df$B, df$A)

Replace NA in a POSIXct serie by adjacent values

I believe this solves your problem:

library(tidyr)

na_inds_begin <- as.numeric((is.na(df$begin)))
na_inds_end <- as.numeric((is.na(df$end)))

na_diffs_lead <- c(0, diff(na_inds_begin))
na_diffs_lag <- c(diff(na_inds_end), 0)

first_nas <- na_inds_begin == 1 & na_diffs_lead > 0
first_nas[1] <- na_inds_begin[1] == 1

last_nas <- na_inds_end == 1 & na_diffs_lag < 0
last_nas[length(last_nas)] <- na_inds_end[length(na_inds_end)] == 1

df$begin[first_nas] <- df$date_time[first_nas]
df$end[last_nas] <- df$date_time[last_nas]

df$begin[first_nas] <- df$date_time[first_nas]
df$end[last_nas] <- df$date_time[last_nas]

df <-
df %>%
fill(begin, .direction = "down") %>%
fill(end, .direction = "up")

First, we find the first NA in each group of NAs in begin, and the last NA in each group of NAs in end. We also need to handle cases where the first element in begin or the last element in end are NA. Then we replace only those elements with the desired replacements. Finally, we fill the rest of each group downward for begin and upward for end.

This is the result:

> df
# A tibble: 5 x 4
individ_id date_time begin end
<chr> <dttm> <dttm> <dttm>
1 NOS_4214433 2017-11-22 09:01:49 2017-11-21 11:54:59 2017-11-22 09:07:27
2 NOS_4214433 2017-11-22 09:06:49 2017-11-21 11:54:59 2017-11-22 09:07:27
3 NOS_4214433 2017-11-22 09:11:49 2017-11-22 09:11:49 2017-11-22 09:16:49
4 NOS_4214433 2017-11-22 09:16:49 2017-11-22 09:11:49 2017-11-22 09:16:49
5 NOS_4214433 2018-01-24 12:12:18 2018-01-24 12:08:28 2018-01-25 09:33:10

Edit: I updated the example code to be robust to the case where begin and end have different NA indices or the first/last elements are NA.

Replace missing values (NA) with most recent non-NA by group

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

by

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

Replace NA in column with value in adjacent column

It didn't work because status was a factor. When you mix factor with numeric then numeric is the least restrictive. By forcing status to be character you get the results you're after and the column is now a character vector:

TEST$UNIT[is.na(TEST$UNIT)] <- as.character(TEST$STATUS[is.na(TEST$UNIT)])

## UNIT STATUS TERMINATED START STOP
## 1 ACTIVE ACTIVE 1999-07-06 2007-04-23 2008-12-05
## 2 INACTIVE INACTIVE 2008-12-05 2008-12-06 4712-12-31
## 3 200 ACTIVE 2000-08-18 2004-06-01 2007-01-31
## 4 200 ACTIVE 2000-08-18 2007-02-01 2008-04-18
## 5 200 INACTIVE 2000-08-18 2008-04-19 2010-11-28
## 6 200 ACTIVE 2008-08-18 2010-11-29 2010-12-29
## 7 200 INACTIVE 2008-08-18 2010-12-30 4712-12-31
## 8 300 ACTIVE 2006-09-19 2007-10-29 2008-02-04
## 9 300 ACTIVE 2006-09-19 2008-02-05 2008-06-29
## 10 300 ACTIVE 2006-09-19 2008-06-30 2009-02-06
## 11 300 INACTIVE 1999-03-15 2009-02-07 4712-12-31


Related Topics



Leave a reply



Submit