How to Fill Nas with Locf by Factors in Data Frame, Split by Country

How to fill NAs with LOCF by factors in data frame, split by country

Here's a ddply solution. Try this

library(plyr)
ddply(DF, .(country), na.locf)
country value
1 AUT <NA>
2 AUT 5
3 AUT 5
4 AUT 5
5 GER <NA>
6 GER <NA>
7 GER 7
8 GER 7
9 GER 7

Edit
From ddply help you can find that

.variables:  variables to split data frame by, 
as quoted variables, a formula or character vector.

so another alternatives to get what you want are:

ddply(DF, "country", na.locf)
ddply(DF, ~country, na.locf)

note that replacing .variables with DF$variable is not allowed, that's why you got an error when doing this.

DF is your data.frame

Replace NA with previous or next value, by group, using dplyr

library(tidyr) #fill is part of tidyr

ps1 %>%
group_by(userID) %>%
#fill(color, age, gender) %>% #default direction down
fill(color, age, gender, .direction = "downup")

Which gives you:

Source: local data frame [9 x 4]
Groups: userID [3]

userID color age gender
<dbl> <fctr> <fctr> <fctr>
1 21 blue 3yrs F
2 21 blue 2yrs F
3 21 red 2yrs M
4 22 blue 3yrs F
5 22 blue 3yrs F
6 22 blue 3yrs F
7 23 red 4yrs F
8 23 red 4yrs F
9 23 gold 4yrs F

Most efficient way to replace NAs in a data frame based on a subset of other row factors (using median as an estimate) in R

You haven't provided a sample data but based on your question, I think this should work.

As @Roland mentioned no need to calculate median separately.

Assuming your dataframe as df. For every group (here Fac1 and Fac2) we calculate the median removing the NA values. Further we select only the indices which has NA values and replace it by its groups median value.

df$Var1[is.na(df$Var1)] <- ave(df$Var1,df$Fac1, df$Fac2, FUN=function(x) 
median(x, na.rm = T)[is.na(df$Var1)]

UPDATE

On request of OP adding some information about ave function.

The first parameter in ave is the one on which you want to do any operation. So here the first parameter is Var1 for which we want to find the median. All the other parameters following that are the grouping variables. It could be any number. Here the grouping variables we have are Fac1 and Fac2. Now comes the function which we want to apply on our first parameter (Var1) for every group (Fac1 and Fac2) which we have defined in the grouping variable. So here for every unique group we are finding the median for that group.

Conditional NA filling by group

This is all about writing a modified na.locf function. After that you can plug it into data.table like any other function.

new.locf <- function(x){
# might want to think about the end of this loop
# this works here but you might need to add another case
# if there are NA's as the last value.
#
# anyway, loop through observations in a vector, x.
for(i in 2:(length(x)-1)){
nextval = i
# find the next, non-NA value
# again, not tested but might break if there isn't one?
while(nextval <= length(x)-1 & is.na(x[nextval])){
nextval = nextval + 1
}
# if the current value is not NA, great!
if(!is.na(x[i])){
x[i] <- x[i]
}else{
# if the current value is NA, and the last value is a value
# (should given the nature of this loop), and
# the next value, as calculated above, is the same as the last
# value, then give us that value.
if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
x[i] <- x[nextval]
}else{
# finally, return NA if neither of these conditions hold
x[i] <- NA
}
}
}
# return the new vector
return(x)
}

Once we have that function, we can use data.table as usual:

dt2 <- dt[,list(year = year,
# when I read your data in, associatedid read as factor
associatedid = new.locf(as.character(associatedid))
),
by = "id"
]

This returns:

> dt2
id year associatedid
1: 1 2000 NA
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 NA
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 NA
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 NA
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123

which is what you are looking for as best I understand it.

I provided some hedging in the new.locf definition so you still might have a little thinking to do but this should get you started.

Filling NA values using the populated values within subgroups

With data.table, we convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'name', do the na.locf in the forward direction on destination with na.rm=FALSE and then do this again in the reverse mode (fromLast=TRUE) and assign (:=) the output back to the same column.

library(zoo)
library(data.table)
setDT(df1)[, destination := na.locf(na.locf(destination,
na.rm=FALSE), fromLast=TRUE), by = name]
df1
# name nav_status destination
#1: A 5 MUMBAI
#2: A 0 MUMBAI
#3: A 1 MUMBAI
#4: B 5 NEW YORK
#5: B 0 NEW YORK
#6: B 1 NEW YORK

I don't succeed to use LOCF method as I want

1) by Use by to split the data into a component for each ID and use it to apply na.locf to each such component. Finally rbind the components back together. No additional packages are used.

do.call("rbind", by(data_rep, data_rep$ID, na.locf, na.rm = FALSE))

2) ave Another approach is to use ave on each column. No additional packages are used. Note that na.locf0 is lke na.locf but only works on vectors and defaults to na.rm = FALSE.

AVE <- function(x) ave(x, data_rep$ID, FUN = na.locf0)
replace(data_rep, TRUE, lapply(data_rep, AVE))

2a) If it is ok to overwrite the input this can be written compactly as:

AVE <- function(x) ave(x, data_rep$ID, FUN = na.locf0)
data_rep[] <- lapply(data_rep, AVE)

3) dplyr Yet another approach is to use group_by in the dplyr package:

library(dplyr)

data_rep %>%
group_by(ID) %>%
na.locf(na.rm = FALSE) %>%
ungroup

4) data.table

library(data.table)

DT <- as.data.table(data_rep)
DT[, na.locf(.SD, na.rm = FALSE), by = ID]

Note that this question is similar to this one except this question has multiple columns -- Carry Last Observation Forward by ID in R

Replace missing values (NA) with most recent non-NA by group

These all use na.locf from the zoo package. Also note that na.locf0 (also defined in zoo) is like na.locf except it defaults to na.rm = FALSE and requires a single vector argument. na.locf2 defined in the first solution is also used in some of the others.

dplyr

library(dplyr)
library(zoo)

na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup

giving:

Source: local data frame [15 x 3]
Groups: houseID

houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44

A variation of this is:

df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup

Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.

Another possibility is to combine the by solution (shown further below) with dplyr:

df %>% by(df$houseID, na.locf2) %>% bind_rows

by

library(zoo)

do.call(rbind, by(df, df$houseID, na.locf2))

ave

library(zoo)

transform(df, price = ave(price, houseID, FUN = na.locf0))

data.table

library(data.table)
library(zoo)

data.table(df)[, na.locf2(.SD), by = houseID]

zoo This solution uses zoo alone. It returns a wide rather than long result:

library(zoo)

z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)

giving:

       1  2  3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44

This solution could be combined with dplyr like this:

library(dplyr)
library(zoo)

df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2

input

Here is the input used for the examples above:

df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))

REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2 from all solutions.

Insert rows based on min and max year and fill with NAs

I guess you want to have rows for each pair of year (from 1994 to 2020) and canton_id. I think you can create full_df with the pairs and then merge it with you data.frame.

full_df <- list(canton_id = unique(relative_FTE$canton_id), year = 1994:2020) %>% expand.grid()
merge(relative_FTE, full_df, all = T, by = c("year","canton_id"))


Related Topics



Leave a reply



Submit