Why is using update on a lm inside a grouped data.table losing its model data?
This is not an answer, but is too long for a comment
The .Environment
for the terms component is identical for each resulting model
e1 <- attr(fit[['V1']][[1]]$terms, '.Environment')
e2 <- attr(fit[['V1']][[2]]$terms, '.Environment')
e3 <- attr(fit[['V1']][[3]]$terms, '.Environment')
identical(e1,e2)
## TRUE
identical(e2, e3)
## TRUE
It appears that data.table
is using the same bit of memory (my non-technical term) for
each evaluation of j
by group (which is efficient). However when update
is called, it is using this to refit the model. This will contain the values from the last group.
So, if you fudge this, it will work
fit = DT[, { xx <-list2env(copy(.SD))
mymodel <-lm(Sepal.Length ~ Sepal.Width + Petal.Length)
attr(mymodel$terms, '.Environment') <- xx
list(list(mymodel))}, by= 'Species']
lfit2 <- fit[, list(list(update(V1[[1]], ~.-Sepal.Width))), by = Species]
lfit2[,lapply(V1,nobs)]
V1 V2 V3
1: 41 39 42
# using your exact diagnostic coding.
lfit2[,nobs(V1[[1]]),by = Species]
Species V1
1: setosa 41
2: versicolor 39
3: virginica 42
not a long term solution, but at least a workaround.
Updating a list of models with a list of data in data.table
Is this what you want:
t1[, list(lapply(tsAll, ets, model = mod1[[1]])), by = group]$V1
I put the result in a list, so that the data type is preserved, as opposed to being converted into a vector and did the operation by group (since each group has its own model).
Breaking the for loop does not provide the correct output
It's only printing 1, exactly because RI
of your second model is larger than the third model and for that, if condition satisfies in the 2nd iteration and the loop breaks before printing 2, therefore, you have only 1 printed, instead try this
for(i in seq_along(fit)){
if (RI(fit[[i]], iris[,5]) > RI(fit[[i+1]], iris[,5])) {
print(i)
break
}
}
Fill missing values by group using linear regression in R
Since you already know how to do this for one dataframe with a single country, you are very close to your solution. But to make this easy on yourself, you need to do a few things.
Create a reproducible example using dput. The
janitor
library has the clean_names() function to fix columns names.Write your own interpolation function that takes a dataframe with one country as the input, and returns an interpolated dataframe for one country.
Pivot_longer to get all the data columns into a one parameterized column.
Use the
dplyr
function group_split to take your large multicountry dataframe, and break it into a list of dataframes, one for each country and parameter.Use the
purrr
function map to map each of the dataframes in the list to a new list of interpolate dataframes.Use dplyr's bind_rows to convert the list interpolated dataframes back into one dataframe, and pivot_wider to get your original data shape back.
library(tidyverse)
library(purrr)
library(janitor)
my_country_interpolater<-function(single_country_df){
data_to_build_model<-single_country_df %>%
filter(!is.na(value)) %>%
select(year,value)
years_to_interpolate<-single_country_df %>%
filter(is.na(value)) %>%
select(year)
fit<-lm(value ~ year, data = data_to_build_model)
value = predict(fit,years_to_interpolate)
interpolated_data<-tibble(years_to_interpolate, value)
single_country_interpolated_df<-bind_rows(data_to_build_model,interpolated_data) %>%
mutate(country_code=single_country_df$country_code[1]) %>%
mutate(parameter=single_country_df$parameter[1]) %>% # added this for the additional parameters
select(country_code, year, parameter, value) %>%
arrange(year)
return (single_country_interpolated_df)
}
interpolated_df <-sampledata2 %>%
clean_names() %>%
pivot_longer(cols=c(3:5),names_to = "parameter", values_to="value") %>%
group_by(country_code,parameter) %>%
group_split() %>%
# map(preprocess_data) %>% if you need a preprocessing step
map(my_country_interpolater) %>%
bind_rows() %>%
pivot_wider(names_from = parameter, values_from=value, names_glue = "{parameter}_interp")
Apply grouped model back onto data
Here is a dplyr
method of obtaining a similar answer, following the approach used by @Mike.Gahan :
library(dplyr)
iris.models <- iris %>%
group_by(Species) %>%
do(mod = lm(Sepal.Length ~ Sepal.Width, data = .))
iris %>%
tbl_df %>%
left_join(iris.models) %>%
rowwise %>%
mutate(Sepal.Length_pred = predict(mod,
newdata = list("Sepal.Width" = Sepal.Width)))
alternatively you can do it in one step if you create a predicting function:
m <- function(df) {
mod <- lm(Sepal.Length ~ Sepal.Width, data = df)
pred <- predict(mod,newdata = df["Sepal.Width"])
data.frame(df,pred)
}
iris %>%
group_by(Species) %>%
do(m(.))
lm() looped over factor variable while dropping single-level factor variables from the model
Your model's formula is conditional on whether or not there are enough levels in each independent variable to be included.
You can create a formula based on these conditions (e.g., using ifelse()
) and then feed the formula to the model inside lapply()
.
Here is a solution:
lapply(unique(df$location), function(z) {
sub_df = dplyr::filter(df, location == z) # subset by location
form_x4 = ifelse(length(unique(sub_df$x4)) > 1, "+ x4", "")
form_x5 = ifelse(length(unique(sub_df$x5)) > 1, "+ x5", "")
form = as.formula(paste("y ~ x1 + x2 + x3", form_x4, form_x5))
return(lm(data = sub_df, formula = form))
})
The form
inside the above lapply(...)
combines the consistent part of the lm()
formula with multiple variables that meet the conditions to be used in the formula. If a variable only has a single level, the ifelse()
statement allows you to treat it as if it's not there when putting it in the formula.
Related Topics
How to Remove Empty Factors from Ggplot2 Facets
Finding Point of Intersection in R
Merge and Perfectly Align Histogram and Boxplot Using Ggplot2
Ggplot2: Change Order of Display of a Factor Variable on an Axis
Subsetting Data.Table Using Variables with Same Name as Column
Alignment of Numbers on the Individual Bars
How to Make a List of All Dataframes That Are in My Global Environment
Split Date into Different Columns for Year, Month and Day
Count Values Separated by a Comma in a Character String
Evaluating Both Column Name and the Target Value Within 'J' Expression Within 'Data.Table'
Operator == Inconsistent in Logical Columns in Data.Table
Remove Ids That Occur X Times R
How to Generate All Possible Combinations of Vectors Without Caring for Order
Dplyr - Using Column Names as Function Arguments
Making a Stacked Bar Plot for Multiple Variables - Ggplot2 in R
Why Would R Use the "L" Suffix to Denote an Integer