Fill in Missing Values by Group in Data.Table

Fill in missing values by group in data.table

There is now a native data.table way of filling missing values (as of 1.12.4).

This question spawned a github issue which was recently closed with the creation of functions nafill and setnafill. You can now use

DT[, value_filled_in := nafill(value, type = "locf")]

It is also possible to fill NA with a constant value or next observation carried back.

One difference to the approach in the question is that these functions currently only work on NA not NaN whereas is.na is TRUE for NaN - this is planned to be fixed in the next release through an extra argument.

I have no involvement with the project but I saw that although the github issue links here, there was no link the other way so I'm answering on behalf of future visitors.

Update: By default NaN is now treated same as NA.

data.table fill missing values from other rows by group

With data.table and zoo:

library(data.table)
library(zoo)

# Last observation carried forward from last row of group
dt <- dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA]

# Last observation carried forward for first row of group
dt[, colB := na.locf(colB), by = colA][]

Or in a single chain:

dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA][
, colB := na.locf(colB), by = colA][]

Both return:

    colA colB
1: 1 4
2: 1 1
3: 1 1
4: 1 1
5: 2 4
6: 2 3
7: 2 3
8: 2 3
9: 3 4
10: 3 2
11: 3 2
12: 3 2

Data:

text <- "colA colB
1 4
1 NA
1 NA
1 1
2 4
2 3
2 NA
2 NA
3 4
3 NA
3 2
3 NA"

dt <- fread(input = text, stringsAsFactors = FALSE)

Fill missing values by rolling forward in each group using data.table

You can use na.locf() function from the zoo package:

DT[, VAL:=zoo::na.locf(VAL, na.rm = FALSE), "CLASS"]

Fill in missing values (nacof/nocb) in character column by group

You can replace empty values with NA and use zoo::na.locf.

library(data.table)

example[, date_data := zoo::na.locf(replace(date_data, date_data == "", NA)), Object]
example

# Object date date_data
# 1: N <NA>
# 2: A 2020-01-01 something
# 3: A 2020-01-01 something
# 4: A 2020-01-01 something
# 5: B 2020-01-01 something
# 6: B 2020-01-01 something
# 7: B 2020-01-01 something
# 8: C 2020-01-01 something
# 9: C 2020-01-01 something
#10: C 2020-01-01 something

and similarly using tidyr's fill :

library(dplyr)

example %>%
mutate(date_data = replace(date_data, date_data == "", NA)) %>%
group_by(Object) %>%
tidyr::fill(date_data, .direction = "up")

How to fill NA with the other values in the same group

Here is a solution using nafill from data.table with type = 'nocb' and type = 'locf' to carry the values backward and forward.

library(data.table)

df <- data.table(group = c('A', 'B', 'B', 'A', 'B', 'B', 'A', 'A'), value = c(NA, 2, NA, 6, NA, 2, 6, NA))


df[ , value := nafill(nafill(value, type = 'nocb'), type = 'locf'), group]


Output:

group value
A 6
B 2
B 2
A 6
B 2
B 2
A 6
A 6

Original table:

group value
A NA
B 2
B NA
A 6
B NA
B 2
A 6
A NA

Created on 2021-03-08 by the reprex package (v0.3.0)

Fill missing values by group using linear regression in R

Since you already know how to do this for one dataframe with a single country, you are very close to your solution. But to make this easy on yourself, you need to do a few things.

  1. Create a reproducible example using dput. The janitor library has the clean_names() function to fix columns names.

  2. Write your own interpolation function that takes a dataframe with one country as the input, and returns an interpolated dataframe for one country.

  3. Pivot_longer to get all the data columns into a one parameterized column.

  4. Use the dplyr function group_split to take your large multicountry dataframe, and break it into a list of dataframes, one for each country and parameter.

  5. Use the purrr function map to map each of the dataframes in the list to a new list of interpolate dataframes.

  6. Use dplyr's bind_rows to convert the list interpolated dataframes back into one dataframe, and pivot_wider to get your original data shape back.


library(tidyverse)
library(purrr)
library(janitor)

my_country_interpolater<-function(single_country_df){

data_to_build_model<-single_country_df %>%
filter(!is.na(value)) %>%
select(year,value)

years_to_interpolate<-single_country_df %>%
filter(is.na(value)) %>%
select(year)

fit<-lm(value ~ year, data = data_to_build_model)
value = predict(fit,years_to_interpolate)


interpolated_data<-tibble(years_to_interpolate, value)

single_country_interpolated_df<-bind_rows(data_to_build_model,interpolated_data) %>%
mutate(country_code=single_country_df$country_code[1]) %>%
mutate(parameter=single_country_df$parameter[1]) %>% # added this for the additional parameters
select(country_code, year, parameter, value) %>%
arrange(year)

return (single_country_interpolated_df)
}

interpolated_df <-sampledata2 %>%
clean_names() %>%
pivot_longer(cols=c(3:5),names_to = "parameter", values_to="value") %>%
group_by(country_code,parameter) %>%
group_split() %>%
# map(preprocess_data) %>% if you need a preprocessing step
map(my_country_interpolater) %>%
bind_rows() %>%
pivot_wider(names_from = parameter, values_from=value, names_glue = "{parameter}_interp")

Efficiently fill NAs by group

This is the code I have used: Your code vs akrun vs mine. Sometimes zoo is not the fastest process but it is the cleanest. Anyway, you can test it.

UPDATE:
It has been tested with more data (100.000) and Process 03 (subset and merge) wins by far.

Last UPDATE
Function comparison with rbenchmark:

library(dplyr)
library(tidyr)
library(base)
library(data.table)
library(zoo)
library(rbenchmark)

#data.frame of 100 individuals with 10 observations each
data <- data.frame(group = rep(1:10000,each=10),value = NA)
data$value[seq(5,5000,10)] <- rnorm(50) #first 50 individuals get a value at the fifth observation, others don't have value

#Process01
P01 <- function (data){
data01 <- data %>%
group_by(group) %>% #by group
fill(value) %>% #default direction down
fill(value, .direction = "up") #also fill NAs upwards
return(data01)
}

#Process02
P02 <- function (data){
data02 <- setDT(data)[, value := na.locf(na.locf(value, na.rm = FALSE),
fromLast = TRUE), group]
return(data02)
}

#Process03
P03 <- function (data){
dataU <- subset(unique(data), value!='NA') #keep row number
dataM <- merge(data, dataU, by = "group", all=T) #merge tables
data03 <- data.frame(group=dataM$group, value = dataM$value.y) #idem shape of data
return(data03)
}

benchmark("P01_dplyr" = {data01 <- P01(data)},
"P02_zoo" = {data02 <- P02(data)},
"P03_data.table" = {data03 <- P03(data)},
replications = 10,
columns = c("test", "replications", "elapsed")
)

Results with data=10.000, 10 reps and I5 7400:

    test replications elapsed
1 P01_dplyr 10 257.78
2 P02_zoo 10 10.35
3 P03_data.table 10 0.09

na.locf in data.table when completing by group

Another possible solution with only the (rolling) join capabilities of data.table:

dt[.(min(a):max(a)), on = .(a), roll = Inf]

which gives:

   a b
1: 1 a
2: 2 a
3: 3 b
4: 4 b
5: 5 c

On large datasets this will probably outperform every other solution.

Courtesy to @Mako212 who gave the hint by using seq in his answer.


First posted solution which works, but gives a warning:

dt[dt[, .(a = Reduce(":", a))], on = .(a), roll = Inf]


Related Topics



Leave a reply



Submit