Replace NA with previous or next value, by group, using dplyr
library(tidyr) #fill is part of tidyr
ps1 %>%
group_by(userID) %>%
#fill(color, age, gender) %>% #default direction down
fill(color, age, gender, .direction = "downup")
Which gives you:
Source: local data frame [9 x 4]
Groups: userID [3]
userID color age gender
<dbl> <fctr> <fctr> <fctr>
1 21 blue 3yrs F
2 21 blue 2yrs F
3 21 red 2yrs M
4 22 blue 3yrs F
5 22 blue 3yrs F
6 22 blue 3yrs F
7 23 red 4yrs F
8 23 red 4yrs F
9 23 gold 4yrs F
Fill missing values with previous values by row using dplyr
One solution could be using na.locf
function from package zoo
combining with pmap
function in row-wise operation. na.locf
takes the most recent non-NA
value and replace all the upcoming NA
values by that. Just as a reminder c(...)
in both solutions captures all values of V1:V4
in each row in every iteration. However, I excluded id
column in both as it is not involved in the our calculations.
library(zoo)
library(purrr)
df %>%
mutate(pmap_df(., ~ na.locf(c(...)[-1])))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or we can use coalesce
function from dplyr
. We can replace every NA
values in each row with the last non-NA
value, something we did earlier with na.locf
. However this solution is a bit verbose:
df %>%
mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))];
coalesce(c(...), x[length(x)])}))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or you could also use this:
library(purrr)
df %>%
mutate(across(!id, ~ replace(., is.na(.), invoke(coalesce, rev(df[-1])))))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
The warning message can be ignored. It is in fact produced because we have 6 NA
values but the result of applying coalesce
on every vector is 1 element resulting in 4 elements to replace 6 slots.
Replace missing values (NA) with most recent non-NA by group
These all use na.locf
from the zoo package. Also note that na.locf0
(also defined in zoo) is like na.locf
except it defaults to na.rm = FALSE
and requires a single vector argument. na.locf2
defined in the first solution is also used in some of the others.
dplyr
library(dplyr)
library(zoo)
na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup
giving:
Source: local data frame [15 x 3]
Groups: houseID
houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44
A variation of this is:
df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup
Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.
Another possibility is to combine the by
solution (shown further below) with dplyr:
df %>% by(df$houseID, na.locf2) %>% bind_rows
by
library(zoo)
do.call(rbind, by(df, df$houseID, na.locf2))
ave
library(zoo)
transform(df, price = ave(price, houseID, FUN = na.locf0))
data.table
library(data.table)
library(zoo)
data.table(df)[, na.locf2(.SD), by = houseID]
zoo This solution uses zoo alone. It returns a wide rather than long result:
library(zoo)
z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)
giving:
1 2 3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44
This solution could be combined with dplyr like this:
library(dplyr)
library(zoo)
df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2
input
Here is the input used for the examples above:
df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))
REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2
from all solutions.
Replace NAs for a group of values with a non-NA character in group in R
Here is an alternative way using na.locf
from zoo
package:
library(zoo)
library(dplyr)
df %>%
group_by(participant_id) %>%
arrange(participant_id, test) %>%
mutate(test = zoo::na.locf(test, na.rm=FALSE))
participant_id test
<chr> <chr>
1 ps1 test1
2 ps1 test1
3 ps1 test1
4 ps1 test1
5 ps2 test2
6 ps2 test2
7 ps3 test3
8 ps3 test3
9 ps3 test3
10 ps3 test3
replace NA value with the group value
Try ave
. It applies a function to groups. Have a look at ?ave
for details, e.g.:
df$med_card_new <- ave(df$med_card, df$hhold_no, FUN=function(x)unique(x[!is.na(x)]))
# person_id hhold_no med_card med_card_new
#1 1 1 1 1
#2 2 1 1 1
#3 3 1 NA 1
#4 4 1 NA 1
#5 5 1 NA 1
#6 6 2 0 0
#7 7 2 0 0
#8 8 2 0 0
#9 9 2 0 0
Please note that this will only work if not all values in a household are NA
and the should not differ (e.g. person 1 == 1, person 2 == 0).
Replacing NA value in dataframe by first or last value of other columns within group
An option is to use replace_na
(from tidyr
) after grouping by 'Group' to replace the NA
elements with either the first
or last
values of 'Dates' column
library(dplyr)
library(tidyr)
df1 %>%
group_by(Group) %>%
mutate(Departure = replace_na(Departure, first(Dates)),
Arrival = replace_na(Arrival, last(Dates))) %>%
ungroup
NOTE: Here we assume that 'Dates' are already order
ed. If not, take the min
and max
after converting to Date
class
library(lubridate)
df1 %>%
mutate(across(-Group, dmy)) %>%
group_by(Group) %>%
mutate(Departure = replace_na(Departure, min(Dates)),
Arrival = replace_na(Arrival, max(Dates))) %>%
ungroup
R replace NA with last value for group ID ordered by date
You can use zoo::na.locf
along with dplyr
verbs. Make sure you group_by
customer id first, and ensure your times are ordered correctly.
library(dplyr)
eg_data %>%
group_by(custID) %>%
arrange(custID, year, quarter) %>%
mutate(orderType = zoo::na.locf(orderType))
#> # A tibble: 32 x 4
#> # Groups: custID [2]
#> custID year quarter orderType
#> <chr> <chr> <chr> <chr>
#> 1 655321 2018 1 retail
#> 2 655321 2018 2 retail
#> 3 655321 2018 3 wholesale
#> 4 655321 2018 4 wholesale
#> 5 655321 2019 1 commercial
#> 6 655321 2019 2 retail
#> 7 655321 2019 3 retail
#> 8 655321 2019 4 retail
#> 9 655321 2020 1 retail
#> 10 655321 2020 2 wholesale
#> # ... with 22 more rows
Replace value with previous row value
Does this work:
library(dplyr)
library(tidyr)
df %>% mutate(DSWP10 = as.numeric(na_if(DSWP10, '.'))) %>% fill(DSWP10, .direction = 'up')
# A tibble: 7 x 2
Date DSWP10
<chr> <dbl>
1 07/01/2015 2.1
2 06/01/2015 1.99
3 05/01/2015 1.99
4 04/01/2015 1.99
5 03/01/2015 1.98
6 02/01/2015 1.95
7 01/01/2015 1.95
How to replace NA with most recent non-NA by group?
As another base R solution, here is a poor man's na.locf
fill_down <- function(v) {
if (length(v) > 1) {
keep <- c(TRUE, !is.na(v[-1]))
v[keep][cumsum(keep)]
} else v
}
To fill down by group, the approach is to use tapply()
to split and apply to each group, and split<-
to combine groups to the original geometry, as
fill_down_by_group <- function(v, grp) {
## original 'by hand':
## split(v, grp) <- tapply(v, grp, fill_down)
## v
## done by built-in function `ave()`
ave(v, grp, FUN=fill_down)
}
To process multiple columns, one might
elts <- c("age", "birthplace")
df[elts] <- lapply(df[elts], fill_down_by_group, df$name)
Notes
I would be interested in seeing how a dplyr solution handles many columns, without hard-coding each? Answering my own question, I guess this is
library(dplyr); library(tidyr)
df %>% group_by(name) %>% fill_(elts)A more efficient base solution when the groups are already 'grouped' (e.g.,
identical(grp, sort(grp))
) isfill_down_by_grouped <- function(v, grp) {
if (length(v) > 1) {
keep <- !(duplicated(v) & is.na(v))
v[keep][cumsum(keep)]
} else v
}For me,
fill_down()
on a vector with about 10M elements takes ~225ms;fill_down_by_grouped()
takes ~300ms independent of the number of groups;fill_down_by_group()
scales with the number of groups; for 10000 groups ~2s, 10M groups about 36s
Replace NA values with a fractional increase in previous values
You can use Reduce
in base R :
Reduce(function(x, y) if(is.na(y)) x * 1.1 else y, df$value, accumulate = TRUE)
#[1] 10.0 15.0 20.0 22.0 24.2
If you want a tidyverse
solution use accumulate
.
library(dplyr)
library(purrr)
df %>% mutate(value = accumulate(value, ~if(is.na(.y)) .x * 1.1 else .y))
# year value
# <dbl> <dbl>
#1 2011 10
#2 2012 15
#3 2013 20
#4 2014 22
#5 2015 24.2
.x
and .y
in accumulate
(or x
and y
in Reduce
) are the current value and next value respectively. So for the first iteration .x
is 10 and .y
is 15, for next .x
will become 15 and .y
will be 20 and so on for all the values in df$value
. We check here if the next value (.y
) is NA
then replace the next value with 1.1 times the previous value (.x
) and if it is not NA
keep it as it is.
Related Topics
Reorder Bars in Geom_Bar Ggplot2 by Value
In R, How to Get an Object'S Name After It Is Sent to a Function
Overlap Join With Start and End Positions
Select/Assign to Data.Table When Variable Names Are Stored in a Character Vector
How to Prevent Ifelse() from Turning Date Objects into Numeric Objects
Why Does Summarize or Mutate Not Work With Group_By When I Load 'Plyr' After 'Dplyr'
Relative Frequencies/Proportions With Dplyr
Pass a Data.Frame Column Name to a Function
How to Use R'S Ellipsis Feature When Writing Your Own Function
Unique Combination of All Elements from Two (Or More) Vectors
Data.Table VS Dplyr: Can One Do Something Well the Other Can't or Does Poorly
How to Use a Variable to Specify Column Name in Ggplot
Complete Dataframe With Missing Combinations of Values
Test If a Vector Contains a Given Element
How to Save for Loop Results in Data Frame Using Cbind
How to Force R to Use a Specified Factor Level as Reference in a Regression