Why Is Using Update on a Lm Inside a Grouped Data.Table Losing Its Model Data

Why is using update on a lm inside a grouped data.table losing its model data?

This is not an answer, but is too long for a comment

The .Environment for the terms component is identical for each resulting model

e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE

It appears that data.table is using the same bit of memory (my non-technical term) for
each evaluation of j by group (which is efficient). However when update is called, it is using this to refit the model. This will contain the values from the last group.

So, if you fudge this, it will work

fit = DT[, { xx <-list2env(copy(.SD))

             mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
             attr(mymodel$terms, '.Environment') <- xx
             list(list(mymodel))}, by= 'Species']





lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
      Species V1
1:     setosa 41
2: versicolor 39
3:  virginica 42

not a long term solution, but at least a workaround.

Updating a list of models with a list of data in data.table

Is this what you want:

t1[, list(lapply(tsAll, ets, model = mod1[[1]])), by = group]$V1

I put the result in a list, so that the data type is preserved, as opposed to being converted into a vector and did the operation by group (since each group has its own model).

Breaking the for loop does not provide the correct output

It's only printing 1, exactly because RI of your second model is larger than the third model and for that, if condition satisfies in the 2nd iteration and the loop breaks before printing 2, therefore, you have only 1 printed, instead try this

for(i in seq_along(fit)){
   if (RI(fit[[i]], iris[,5]) > RI(fit[[i+1]], iris[,5])) {
      print(i)
      break
     }    
}

Fill missing values by group using linear regression in R

Since you already know how to do this for one dataframe with a single country, you are very close to your solution. But to make this easy on yourself, you need to do a few things.

Create a reproducible example using dput. The janitor library has the clean_names() function to fix columns names.
Write your own interpolation function that takes a dataframe with one country as the input, and returns an interpolated dataframe for one country.
Pivot_longer to get all the data columns into a one parameterized column.
Use the dplyr function group_split to take your large multicountry dataframe, and break it into a list of dataframes, one for each country and parameter.
Use the purrr function map to map each of the dataframes in the list to a new list of interpolate dataframes.
Use dplyr's bind_rows to convert the list interpolated dataframes back into one dataframe, and pivot_wider to get your original data shape back.


library(tidyverse)
library(purrr)
library(janitor)

my_country_interpolater<-function(single_country_df){
  
  data_to_build_model<-single_country_df %>%
    filter(!is.na(value)) %>%
    select(year,value)
  
  years_to_interpolate<-single_country_df %>%
    filter(is.na(value)) %>%
    select(year)
  
  fit<-lm(value ~ year, data = data_to_build_model)
  value = predict(fit,years_to_interpolate)
  
  
  interpolated_data<-tibble(years_to_interpolate, value)
  
  single_country_interpolated_df<-bind_rows(data_to_build_model,interpolated_data) %>% 
    mutate(country_code=single_country_df$country_code[1]) %>%
    mutate(parameter=single_country_df$parameter[1]) %>%  # added this for the additional parameters
    select(country_code, year, parameter, value) %>%
    arrange(year) 
  
  return (single_country_interpolated_df)
}

interpolated_df <-sampledata2 %>%
  clean_names() %>% 
  pivot_longer(cols=c(3:5),names_to = "parameter", values_to="value") %>%
  group_by(country_code,parameter) %>%
  group_split() %>% 
 # map(preprocess_data) %>% if you need a preprocessing step
  map(my_country_interpolater) %>%
  bind_rows() %>%
  pivot_wider(names_from = parameter, values_from=value, names_glue = "{parameter}_interp")

Apply grouped model back onto data

Here is a dplyr method of obtaining a similar answer, following the approach used by @Mike.Gahan :

library(dplyr) 

iris.models <- iris %>%
  group_by(Species) %>%
  do(mod = lm(Sepal.Length ~ Sepal.Width, data = .))

iris %>% 
  tbl_df %>%
  left_join(iris.models) %>%
  rowwise %>%
  mutate(Sepal.Length_pred = predict(mod,
                                    newdata = list("Sepal.Width" = Sepal.Width)))

alternatively you can do it in one step if you create a predicting function:

m <- function(df) {
  mod <- lm(Sepal.Length ~ Sepal.Width, data = df)
  pred <- predict(mod,newdata = df["Sepal.Width"])
  data.frame(df,pred)
}

iris %>%
  group_by(Species) %>%
  do(m(.))

lm() looped over factor variable while dropping single-level factor variables from the model

Your model's formula is conditional on whether or not there are enough levels in each independent variable to be included.

You can create a formula based on these conditions (e.g., using ifelse()) and then feed the formula to the model inside lapply().

Here is a solution:

lapply(unique(df$location), function(z) {
    sub_df = dplyr::filter(df, location == z) # subset by location
    form_x4 = ifelse(length(unique(sub_df$x4)) > 1, "+ x4", "")
    form_x5 = ifelse(length(unique(sub_df$x5)) > 1, "+ x5", "")
    form = as.formula(paste("y ~ x1 + x2 + x3", form_x4, form_x5))
    return(lm(data = sub_df, formula = form))
})

The form inside the above lapply(...) combines the consistent part of the lm() formula with multiple variables that meet the conditions to be used in the formula. If a variable only has a single level, the ifelse() statement allows you to treat it as if it's not there when putting it in the formula.

Why Is Using Update on a Lm Inside a Grouped Data.Table Losing Its Model Data