How to fill NAs with LOCF by factors in data frame, split by country
Here's a ddply
solution. Try this
library(plyr)
ddply(DF, .(country), na.locf)
country value
1 AUT <NA>
2 AUT 5
3 AUT 5
4 AUT 5
5 GER <NA>
6 GER <NA>
7 GER 7
8 GER 7
9 GER 7
Edit
From ddply
help you can find that
.variables: variables to split data frame by,
as quoted variables, a formula or character vector.
so another alternatives to get what you want are:
ddply(DF, "country", na.locf)
ddply(DF, ~country, na.locf)
note that replacing .variables
with DF$variable
is not allowed, that's why you got an error when doing this.
DF
is your data.frame
Replace NA with previous or next value, by group, using dplyr
library(tidyr) #fill is part of tidyr
ps1 %>%
group_by(userID) %>%
#fill(color, age, gender) %>% #default direction down
fill(color, age, gender, .direction = "downup")
Which gives you:
Source: local data frame [9 x 4]
Groups: userID [3]
userID color age gender
<dbl> <fctr> <fctr> <fctr>
1 21 blue 3yrs F
2 21 blue 2yrs F
3 21 red 2yrs M
4 22 blue 3yrs F
5 22 blue 3yrs F
6 22 blue 3yrs F
7 23 red 4yrs F
8 23 red 4yrs F
9 23 gold 4yrs F
Most efficient way to replace NAs in a data frame based on a subset of other row factors (using median as an estimate) in R
You haven't provided a sample data but based on your question, I think this should work.
As @Roland mentioned no need to calculate median
separately.
Assuming your dataframe as df
. For every group (here Fac1
and Fac2
) we calculate the median removing the NA
values. Further we select only the indices which has NA
values and replace it by its groups median value.
df$Var1[is.na(df$Var1)] <- ave(df$Var1,df$Fac1, df$Fac2, FUN=function(x)
median(x, na.rm = T)[is.na(df$Var1)]
UPDATE
On request of OP adding some information about ave
function.
The first parameter in ave
is the one on which you want to do any operation. So here the first parameter is Var1
for which we want to find the median
. All the other parameters following that are the grouping variables. It could be any number. Here the grouping variables we have are Fac1
and Fac2
. Now comes the function which we want to apply on our first parameter (Var1
) for every group (Fac1
and Fac2
) which we have defined in the grouping variable. So here for every unique group we are finding the median
for that group.
Conditional NA filling by group
This is all about writing a modified na.locf function. After that you can plug it into data.table like any other function.
new.locf <- function(x){
# might want to think about the end of this loop
# this works here but you might need to add another case
# if there are NA's as the last value.
#
# anyway, loop through observations in a vector, x.
for(i in 2:(length(x)-1)){
nextval = i
# find the next, non-NA value
# again, not tested but might break if there isn't one?
while(nextval <= length(x)-1 & is.na(x[nextval])){
nextval = nextval + 1
}
# if the current value is not NA, great!
if(!is.na(x[i])){
x[i] <- x[i]
}else{
# if the current value is NA, and the last value is a value
# (should given the nature of this loop), and
# the next value, as calculated above, is the same as the last
# value, then give us that value.
if(is.na(x[i]) & !is.na(x[i-1]) & x[i-1] == x[nextval]){
x[i] <- x[nextval]
}else{
# finally, return NA if neither of these conditions hold
x[i] <- NA
}
}
}
# return the new vector
return(x)
}
Once we have that function, we can use data.table as usual:
dt2 <- dt[,list(year = year,
# when I read your data in, associatedid read as factor
associatedid = new.locf(as.character(associatedid))
),
by = "id"
]
This returns:
> dt2
id year associatedid
1: 1 2000 NA
2: 1 2001 ABC123
3: 1 2002 ABC123
4: 1 2003 ABC123
5: 1 2004 ABC123
6: 1 2005 ABC123
7: 2 2000 NA
8: 2 2001 ABC123
9: 2 2002 ABC123
10: 2 2003 NA
11: 2 2004 DEF456
12: 2 2005 DEF456
13: 3 2000 NA
14: 3 2001 ABC123
15: 3 2002 ABC123
16: 3 2003 ABC123
17: 3 2004 ABC123
18: 3 2005 ABC123
which is what you are looking for as best I understand it.
I provided some hedging in the new.locf definition so you still might have a little thinking to do but this should get you started.
Filling NA values using the populated values within subgroups
With data.table
, we convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'name', do the na.locf
in the forward direction on destination with na.rm=FALSE
and then do this again in the reverse mode (fromLast=TRUE
) and assign (:=
) the output back to the same column.
library(zoo)
library(data.table)
setDT(df1)[, destination := na.locf(na.locf(destination,
na.rm=FALSE), fromLast=TRUE), by = name]
df1
# name nav_status destination
#1: A 5 MUMBAI
#2: A 0 MUMBAI
#3: A 1 MUMBAI
#4: B 5 NEW YORK
#5: B 0 NEW YORK
#6: B 1 NEW YORK
I don't succeed to use LOCF method as I want
1) by Use by
to split the data into a component for each ID and use it to apply na.locf
to each such component. Finally rbind
the components back together. No additional packages are used.
do.call("rbind", by(data_rep, data_rep$ID, na.locf, na.rm = FALSE))
2) ave Another approach is to use ave
on each column. No additional packages are used. Note that na.locf0
is lke na.locf
but only works on vectors and defaults to na.rm = FALSE
.
AVE <- function(x) ave(x, data_rep$ID, FUN = na.locf0)
replace(data_rep, TRUE, lapply(data_rep, AVE))
2a) If it is ok to overwrite the input this can be written compactly as:
AVE <- function(x) ave(x, data_rep$ID, FUN = na.locf0)
data_rep[] <- lapply(data_rep, AVE)
3) dplyr Yet another approach is to use group_by
in the dplyr package:
library(dplyr)
data_rep %>%
group_by(ID) %>%
na.locf(na.rm = FALSE) %>%
ungroup
4) data.table
library(data.table)
DT <- as.data.table(data_rep)
DT[, na.locf(.SD, na.rm = FALSE), by = ID]
Note that this question is similar to this one except this question has multiple columns -- Carry Last Observation Forward by ID in R
Replace missing values (NA) with most recent non-NA by group
These all use na.locf
from the zoo package. Also note that na.locf0
(also defined in zoo) is like na.locf
except it defaults to na.rm = FALSE
and requires a single vector argument. na.locf2
defined in the first solution is also used in some of the others.
dplyr
library(dplyr)
library(zoo)
na.locf2 <- function(x) na.locf(x, na.rm = FALSE)
df %>% group_by(houseID) %>% do(na.locf2(.)) %>% ungroup
giving:
Source: local data frame [15 x 3]
Groups: houseID
houseID year price
1 1 1995 NA
2 1 1996 100
3 1 1997 100
4 1 1998 120
5 1 1999 120
6 2 1995 NA
7 2 1996 NA
8 2 1997 NA
9 2 1998 30
10 2 1999 30
11 3 1995 NA
12 3 1996 44
13 3 1997 44
14 3 1998 44
15 3 1999 44
A variation of this is:
df %>% group_by(houseID) %>% mutate(price = na.locf0(price)) %>% ungroup
Other solutions below give output which is quite similar so we won't repeat it except where the format differs substantially.
Another possibility is to combine the by
solution (shown further below) with dplyr:
df %>% by(df$houseID, na.locf2) %>% bind_rows
by
library(zoo)
do.call(rbind, by(df, df$houseID, na.locf2))
ave
library(zoo)
transform(df, price = ave(price, houseID, FUN = na.locf0))
data.table
library(data.table)
library(zoo)
data.table(df)[, na.locf2(.SD), by = houseID]
zoo This solution uses zoo alone. It returns a wide rather than long result:
library(zoo)
z <- read.zoo(df, index = 2, split = 1, FUN = identity)
na.locf2(z)
giving:
1 2 3
1995 NA NA NA
1996 100 NA 44
1997 100 NA 44
1998 120 30 44
1999 120 30 44
This solution could be combined with dplyr like this:
library(dplyr)
library(zoo)
df %>% read.zoo(index = 2, split = 1, FUN = identity) %>% na.locf2
input
Here is the input used for the examples above:
df <- structure(list(houseID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L), year = c(1995L, 1996L, 1997L, 1998L,
1999L, 1995L, 1996L, 1997L, 1998L, 1999L, 1995L, 1996L, 1997L,
1998L, 1999L), price = c(NA, 100L, NA, 120L, NA, NA, NA, NA,
30L, NA, NA, 44L, NA, NA, NA)), .Names = c("houseID", "year",
"price"), class = "data.frame", row.names = c(NA, -15L))
REVISED Re-arranged and added more solutions. Revised dplyr/zoo solution to conform to latest changes dplyr. Applied fixed and factored out na.locf2
from all solutions.
Insert rows based on min and max year and fill with NAs
I guess you want to have rows for each pair of year (from 1994 to 2020) and canton_id. I think you can create full_df
with the pairs and then merge it with you data.frame.
full_df <- list(canton_id = unique(relative_FTE$canton_id), year = 1994:2020) %>% expand.grid()
merge(relative_FTE, full_df, all = T, by = c("year","canton_id"))
Related Topics
How to Merge Two Data.Table by Different Column Names
R: Using Rgl to Generate 3D Rotatable Plots That Can Be Viewed in a Web Browser
Calculating the Difference Between Consecutive Rows by Group Using Dplyr
Mutating Multiple Columns in a Data Frame Using Dplyr
Sum of Antidiagonal of a Matrix
Counting Unique Items in Data Frame
Assign New Data Point to Cluster in Kernel K-Means (Kernlab Package in R)
Add One Column Below Another in a Data.Frame in R
How to Plot the Relative Proportions of Two Groups Using a Fill Aesthetic in Ggplot2
R: How to Total the Number of Na in Each Col of Data.Frame
How to Write a Function That Calls a Function That Calls Data.Table
Defer Code to End of Document in Knitr
How to Convert a String in a Function into an Object
Add Annotation and Segments to Groups of Legend Elements
Multiple Lines Each Based on a Different Dataframe in Ggplot2 - Automatic Coloring and Legend