Replace Na Value with the Group Value

Replace NA with previous or next value, by group, using dplyr


library(tidyr) #fill is part of tidyr

ps1 %>%
group_by(userID) %>%
#fill(color, age, gender) %>% #default direction down
fill(color, age, gender, .direction = "downup")

Which gives you:

Source: local data frame [9 x 4]
Groups: userID [3]

userID color age gender
<dbl> <fctr> <fctr> <fctr>
1 21 blue 3yrs F
2 21 blue 2yrs F
3 21 red 2yrs M
4 22 blue 3yrs F
5 22 blue 3yrs F
6 22 blue 3yrs F
7 23 red 4yrs F
8 23 red 4yrs F
9 23 gold 4yrs F

Replace NA with minimum Group Value R

I changed the colname to 'Year' because it was colliding to

dat %>% 
dplyr::group_by(name) %>%
dplyr::mutate(start = dplyr::if_else(start == "na", min(Year), start))
# A tibble: 8 x 3
# Groups: name [3]
Year name start
<chr> <chr> <chr>
1 2010 Emma 1998
2 2011 Emma 1998
3 2012 Emma 1998
4 2009 John 2009
5 2010 John 2009
6 2012 John 2009
7 2007 Louis 2007
8 2012 Louis 2007

Replace NAs for a group of values with a non-NA character in group in R

Here is an alternative way using na.locf from zoo package:

library(zoo)
library(dplyr)
df %>%
group_by(participant_id) %>%
arrange(participant_id, test) %>%
mutate(test = zoo::na.locf(test, na.rm=FALSE))
   participant_id test 
<chr> <chr>
1 ps1 test1
2 ps1 test1
3 ps1 test1
4 ps1 test1
5 ps2 test2
6 ps2 test2
7 ps3 test3
8 ps3 test3
9 ps3 test3
10 ps3 test3

replace NA value with the group value

Try ave. It applies a function to groups. Have a look at ?ave for details, e.g.:

df$med_card_new <- ave(df$med_card, df$hhold_no, FUN=function(x)unique(x[!is.na(x)]))

# person_id hhold_no med_card med_card_new
#1 1 1 1 1
#2 2 1 1 1
#3 3 1 NA 1
#4 4 1 NA 1
#5 5 1 NA 1
#6 6 2 0 0
#7 7 2 0 0
#8 8 2 0 0
#9 9 2 0 0

Please note that this will only work if not all values in a household are NA and the should not differ (e.g. person 1 == 1, person 2 == 0).

Replacing NA value in dataframe by first or last value of other columns within group

An option is to use replace_na (from tidyr) after grouping by 'Group' to replace the NA elements with either the first or last values of 'Dates' column

library(dplyr)
library(tidyr)
df1 %>%
group_by(Group) %>%
mutate(Departure = replace_na(Departure, first(Dates)),
Arrival = replace_na(Arrival, last(Dates))) %>%
ungroup

NOTE: Here we assume that 'Dates' are already ordered. If not, take the min and max after converting to Date class

library(lubridate)
df1 %>%
mutate(across(-Group, dmy)) %>%
group_by(Group) %>%
mutate(Departure = replace_na(Departure, min(Dates)),
Arrival = replace_na(Arrival, max(Dates))) %>%
ungroup

Replace NA with sample() by group

We can write a function if there is NA value in LENGTH and number of rows in group is more than 1, select random value from the group excluding NA values. If the number of values in group is just 1 then select random value from the entire data excluding NA values. We apply the function for both the columns LENGTH and NUMPOINTS.

replace_func <- function(x, y) {
inds <- is.na(x)
if (length(x) > 1 & any(inds)) {
x[inds] <- sample(x[!inds], sum(inds))
x
}
else if(any(inds)) {
x[inds] <- sample(y[!is.na(y)], 1)
x
} else x
}

library(dplyr)

exam %>%
group_by(Outlet) %>%
mutate(nuLENGTH = replace_func(LENGTH, .$LENGTH),
nuNUMPOINTS = replace_func(NUMPOINTS, .$NUMPOINTS))

# LENGTH NUMPOINTS CTRY_ Outlet nuLENGTH nuNUMPOINTS
# <dbl> <dbl> <fct> <fct> <dbl> <dbl>
# 1 NA NA WCY_____ES ESO0244476 7046. 19
# 2 45447. 0 WCY_____FR DSO0213034 45447. 0
# 3 161038. 8 WCY_____FR DSO0213034 161038. 8
# 4 78148. 5 WCY_____FR DSO0213034 78148. 5
# 5 7193. 2 WCY_____FR DSO0045543 7193. 2
# 6 1572. 0 WCY_____FR DSO0045543 1572. 0
# 7 18191. 9 WCY_____FR DSO0045543 18191. 9
# 8 20366. 0 WCY_____FR DSO0045543 20366. 0
# 9 10015. 0 WCY_____FR DSO0045543 10015. 0
#10 1404. 0 WCY_____FR DSO0045543 1404. 0
# … with 32 more rows

Replace missing values (NA) with most recent non-NA by group

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

by

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

R replace NA with last value for group ID ordered by date

You can use zoo::na.locf along with dplyr verbs. Make sure you group_by customer id first, and ensure your times are ordered correctly.

library(dplyr)

eg_data %>%
group_by(custID) %>%
arrange(custID, year, quarter) %>%
mutate(orderType = zoo::na.locf(orderType))

#> # A tibble: 32 x 4
#> # Groups: custID [2]
#> custID year quarter orderType
#> <chr> <chr> <chr> <chr>
#> 1 655321 2018 1 retail
#> 2 655321 2018 2 retail
#> 3 655321 2018 3 wholesale
#> 4 655321 2018 4 wholesale
#> 5 655321 2019 1 commercial
#> 6 655321 2019 2 retail
#> 7 655321 2019 3 retail
#> 8 655321 2019 4 retail
#> 9 655321 2020 1 retail
#> 10 655321 2020 2 wholesale
#> # ... with 22 more rows

How to replace NA with most recent non-NA by group?

As another base R solution, here is a poor man's na.locf

fill_down <- function(v) {
if (length(v) > 1) {
keep <- c(TRUE, !is.na(v[-1]))
v[keep][cumsum(keep)]
} else v
}

To fill down by group, the approach is to use tapply() to split and apply to each group, and split<- to combine groups to the original geometry, as

fill_down_by_group <- function(v, grp) {
## original 'by hand':
## split(v, grp) <- tapply(v, grp, fill_down)
## v
## done by built-in function `ave()`
ave(v, grp, FUN=fill_down)
}

To process multiple columns, one might

elts <- c("age", "birthplace")
df[elts] <- lapply(df[elts], fill_down_by_group, df$name)

Notes

  1. I would be interested in seeing how a dplyr solution handles many columns, without hard-coding each? Answering my own question, I guess this is

    library(dplyr); library(tidyr)
    df %>% group_by(name) %>% fill_(elts)
  2. A more efficient base solution when the groups are already 'grouped' (e.g., identical(grp, sort(grp))) is

    fill_down_by_grouped <- function(v, grp) {
    if (length(v) > 1) {
    keep <- !(duplicated(v) & is.na(v))
    v[keep][cumsum(keep)]
    } else v
    }
  3. For me, fill_down() on a vector with about 10M elements takes ~225ms; fill_down_by_grouped() takes ~300ms independent of the number of groups; fill_down_by_group() scales with the number of groups; for 10000 groups ~2s, 10M groups about 36s

Replace NaN with existing value of the group

You can groupby and transform with max then fillna:

df['Value'] = df['Value'].fillna(df.groupby("Name")['Value'].transform('max'))
print(df)


  Name  Value
0 AA 33.0
1 AA 24.0
2 BB 23.0
3 BB 23.0
4 CC 45.0
5 CC 23.0
6 CC 45.0


Related Topics



Leave a reply



Submit