Create and Call Linear Models from List

How to put a column of list into the linear regression model parameter in R?

Your data frame looks like something written or converted from python. Might be better of working with that.

Essentially amenities is a list, if I run your code:

class(train$sorted_amenities)
[1] "list"

And host_verifications is a character which is useless to you.

We need to onehot encode sorted_amenities, using strsplit like you did:

#create the list
lst = strsplit(as.character(train$amenities),"[ ]*[,][ ]*")
#collect all your possible values
allitems = unique(unlist(lst))
amenities_counts = t(sapply(lst,function(i)as.numeric(allitems %in% i)))
colnames(amenities_counts) = allitems
head(amenities_counts[,1:4])

TV Wifi Air conditioning Kitchen
[1,] 1 1 1 1
[2,] 1 1 0 1
[3,] 1 1 1 1

Then we deal with your host_verifications :

lst = strsplit(gsub("[^a-z,_]*","",train$host_verifications),",")
#collect all your possible values
allitems = unique(unlist(lst))
veri_counts = t(sapply(lst,function(i)as.numeric(allitems %in% i)))
colnames(veri_counts) = allitems
head(veri_counts)

phone email facebook reviews jumio government_id work_email
[1,] 1 0 0 0 0 0 0
[2,] 1 1 1 1 1 1 1
[3,] 1 1 0 1 1 1 0

Now make a final data.frame (I cannot see price in the data you provided):

final = data.frame(host_name = train$host_name,
amenities_counts,veri_counts)

You can run your regression on those after removing columns with say only one '1' etc..

Run linear models by group over list of variables in R

This is how I would do it. Note this is untested as I haven't installed relaimpo. I'm really just re-packaging your code.

The general method is
1. develop a function that works on one group
2. use split to divide your data into groups
3. use lapply to apply the function to each group
4. (if needed) combine the results together

First test on a one-site subset of data:

The only changes I made are (a) to pull out a subset of data for one site and name it one_site. (b) to use one_site in your modeling code. (c) I prefer pasting a formula together as a string to using substitute, so I made that change. (d) White space and formatting for readability (mostly using RStudio's "reformat code").

## set up
varlist <- names(d)[4:9]
library(relaimpo)
sumfun <- function(x) {
c(
coef(x),
summary(x)$adj.r.squared,
sqrt(mean(resid(x) ^ 2, na.rm = TRUE)),
calc.relimp(x, type = "betasq")$betasq[1],
calc.relimp(x, type = "betasq")$betasq[2],
calc.relimp(x, type = "pratt")$pratt[1],
calc.relimp(x, type = "pratt")$pratt[2]
)
}

## Testing: this works for one_site
one_site <- subset(d, SiteName == "bp10")

models <- lapply(varlist, function(x) { # apply the modeling function to our list of air variables
form <- as.formula(sprintf("DMWT ~ DMAT + %s", x))
lm(form, data = one_site) # linear model with air variable substituted
})

## desired result
mod.df <- as.data.frame(t(sapply(models, sumfun)))

Turn it into a function

Once you have code that works for a single site, we turn it into a function. The only inputs seem to be the data for one site and the variables in varlist. Instead of assigning the result at the bottom, we return it:

fit_one_site = function(one_site, varlist) {
models <- lapply(varlist, function(x) {
# apply the modeling function to our list of air variables
form = as.formula(sprintf("DMWT ~ DMAT + %s", x))
lm(form, data = one_site) # linear model with air variable substituted
})
return(as.data.frame(t(sapply(models, sumfun))))
}

Now we can use split to split your data up by SiteName, and lapply to apply the fit_one_site function to each piece.

results = lapply(split(d, d$SiteName), FUN = fit_one_site, varlist = names(d)[4:9])

The results should be list of data frames, one for each site. If you want to combine them into one data frame, see the relevant part of my answer at the list of data frames R-FAQ.

R: How do you loop an linear model over a list of data frames?

Let's step through a different way to imagine doing this kind of thing.

First, know that for loops have their place, and when done properly they can be just as fast as an *apply function. Though your use of the loop is syntactically correct, there are different ways to use it that may make more sense. You are trying to run a series of commands or a function on multiple elements of a list. Imagine this simple plan: for each element in the list, take the first element and then double and square it:

invec <- list(c(21,22),c(23,24),c(25,26))
str(invec)
# List of 3
# $ : num [1:2] 21 22
# $ : num [1:2] 23 24
# $ : num [1:2] 25 26
outvec <- replicate(length(invec), NULL) # preallocate same size
for (i in seq_along(invec)) {
outvec[[i]] <- c(2*invec[[i]][1], invec[[i]][1]^2)
}
str(outvec)
# List of 3
# $ : num [1:2] 42 441
# $ : num [1:2] 46 529
# $ : num [1:2] 50 625

Seems simple enough. Now let's see how to do this same thing with an *apply function:

invec <- list(c(21,22),c(23,24),c(25,26))
outvec <- lapply(invec, function(a) c(2*a[1], a[1]^2))
str(outvec)
# List of 3
# $ : num [1:2] 42 441
# $ : num [1:2] 46 529
# $ : num [1:2] 50 625

The way to read the apply function is "take the vector invec, and call this function on each element, capturing the results into a list names outvec". The function can be "anonymous" (like it is here), or it can be a "named" function, such as

lapply(invec, max)
# [[1]]
# [1] 22
#
# [[2]]
# [1] 24
#
# [[3]]
# [1] 26

So how does this help your sampling problem? Let me diverge for another second.

Are you aware than you can index a vector and list arbitrarily? For instance:

str(invec[c(1,3,2,3,2,3)])
# List of 6
# $ : num [1:2] 21 22
# $ : num [1:2] 25 26
# $ : num [1:2] 23 24
# $ : num [1:2] 25 26
# $ : num [1:2] 23 24
# $ : num [1:2] 25 26

There are dupes, okay. Let's say we want to grab 1000 random samples from this very short list:

set.seed(3)
ind <- sample(length(invec), size=1000, replace=TRUE)
str(outvec[1:4])
# List of 4
# $ : num 42
# $ : num 46
# $ : num 50
# $ : num 46
outvec <- lapply(invec[ind], function(a) 2*a[1])
str(outvec[1:4])
# List of 4
# $ : num 42
# $ : num 50
# $ : num 46
# $ : num 42

Okay, so we've sampled the original list 1000 times and done our processing of it (2*a[1]), and stored the results.

So let's apply this to your scenario. Since your data is sight-unseen, I'll make up some.

set.seed(2)
n <- 20
lst <- lapply(1:185, function(ign) data.frame(x=sample(100,size=n), y=sample(100,size=n)))

str(lst[1:2])
# List of 2
# $ :'data.frame': 20 obs. of 2 variables:
# ..$ x: int [1:20] 19 70 57 17 91 90 13 78 44 51 ...
# ..$ y: int [1:20] 67 39 83 15 34 47 97 96 89 13 ...
# $ :'data.frame': 20 obs. of 2 variables:
# ..$ x: int [1:20] 99 30 12 16 91 76 92 33 47 74 ...
# ..$ y: int [1:20] 78 88 62 26 83 42 37 43 21 7 ...

Now I have a list of 185 data.frames, each with the same two variables x and y. Let's apply your question to this data. Oh, and randomness can be time-consuming. (BTW: it is much faster to get 1000 random numbers once then 1 random number 1000 times.)

ind <- sample(185, size=1000, replace=TRUE)

Now, lst[ind] will be a list, 1000 elements long, each a random selection from the original list.

lms <- lapply(lst[ind], function(a) lm(y~x, data=a))

(The lm part can be whatever you need, as long as it is the same regression applied to all elements. The code in the function can be as long as you need, so perhaps think of it this way:

lms <- lapply(lst[ind], function(a) {
z <- lm(y~x, data=a)
return(z)
})

Does that make sense?) Okay, let's look at some of the output:

summary(lms[[1]])
# Call:
# lm(formula = y ~ x, data = a)
# Residuals:
# Min 1Q Median 3Q Max
# -53.944 -13.463 -1.239 15.473 44.430
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 80.4523 15.3577 5.239 5.56e-05 ***
# x -0.4217 0.2499 -1.687 0.109
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Residual standard error: 25.23 on 18 degrees of freedom
# Multiple R-squared: 0.1366, Adjusted R-squared: 0.0886
# F-statistic: 2.847 on 1 and 18 DF, p-value: 0.1088
summary(lms[[2]])
# Call:
# lm(formula = y ~ x, data = a)
# Residuals:
# Min 1Q Median 3Q Max
# -55.108 -20.653 -0.465 18.827 42.747
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 60.7651 12.2366 4.966 1e-04 ***
# x -0.1898 0.2060 -0.922 0.369
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Residual standard error: 27.37 on 18 degrees of freedom
# Multiple R-squared: 0.04506, Adjusted R-squared: -0.007996
# F-statistic: 0.8493 on 1 and 18 DF, p-value: 0.3689

"But I don't need the whole model, I just want the coefficients!" Sure, you're right. When you know you only need one thing, you can obviously just cut-to-the-chase and get that directly (such as coef(lm(y~x,data=a))). So, instead of me re-running the 1000 regressions of random samples, I can just do another lapply:

coefs <- lapply(lms[1:3], coef)
str(coefs[1:3])
# List of 3
# $ : Named num [1:2] 80.452 -0.422
# ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
# $ : Named num [1:2] 60.77 -0.19
# ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
# $ : Named num [1:2] 53.716 -0.189
# ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"

In this case, I actually have a couple of options. I can either stick with this and "rbind" (row-bind) them together, with

head(do.call(rbind, coefs))
# (Intercept) x
# [1,] 80.45230 -0.42173749
# [2,] 60.76507 -0.18979726
# [3,] 53.71643 -0.18883933
# [4,] 49.51803 0.01494021
# [5,] 49.51803 0.01494021
# [6,] 68.25463 -0.25840920

Or I could have used a "simple-apply" earlier that (optionally, but default yes) simplifies the results for you into a matrix or vector. If any of the returned values are of a different size than the others, it will always return a list. (Because of this, it might be more programmatically defensible to not simplify it, do some sanity checks, and then rbind them.)

coefs2 <- t(sapply(lms, coef))
head(coefs2)
# (Intercept) x
# [1,] 80.45230 -0.42173749
# [2,] 60.76507 -0.18979726
# [3,] 53.71643 -0.18883933
# [4,] 49.51803 0.01494021
# [5,] 49.51803 0.01494021
# [6,] 68.25463 -0.25840920

Notice that I had to transpose the output: it's a little kooky and counter-intuitive in that the output (without t(...)) will have 2 rows (one for each regression coefficient) and 1000 columns. So we transpose it, since I for one naturally think of it as row-per-model. This is not required if you can handle it as column-per-model.

So bottom line, your for loop is not syntactically wrong per se, but if you think about doing ONE thing to a vector/list of MANY things in this fashion, you will get significant speed improvements (in this case) and, arguably, once you understand it, much more readable code.

Extracting model coefficients from a nested list (list-columns)

Based on the description, if we need to create some columns with the coefficients, one option is using do with group_by. After grouping by 'id', extract the coefficients as a list within do and rename the columns (if needed)

library(tidyverse)
d1 %>%
group_by(id) %>%
do(data.frame(., as.list(coef(lm(val3 ~ val1 + val2, data = .))))) %>%
rename_at(5:7, ~c("intc", "coef1", "coef2"))
# A tibble: 15 x 7
# Groups: id [3]
# id val1 val2 val3 intc coef1 coef2
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Name1 6.76 2.64 9.85 9.03 -0.0710 0.0212
# 2 Name1 6.78 1.52 6.94 9.03 -0.0710 0.0212
# 3 Name1 4.14 4.31 8.30 9.03 -0.0710 0.0212
# 4 Name1 5.55 2.16 9.97 9.03 -0.0710 0.0212
# 5 Name1 6.18 3.64 8.32 9.03 -0.0710 0.0212
# 6 Name2 6.08 1.12 9.96 7.33 0.468 -0.488
# 7 Name2 3.54 4.71 6.43 7.33 0.468 -0.488
# 8 Name2 5.66 4.54 8.69 7.33 0.468 -0.488
# 9 Name2 6.88 4.15 7.79 7.33 0.468 -0.488
#10 Name2 4.89 1.27 8.72 7.33 0.468 -0.488
#11 Name3 6.41 4.38 6.22 20.1 -2.15 0.118
#12 Name3 5.06 3.28 9.42 20.1 -2.15 0.118
#13 Name3 6.25 3.16 8.15 20.1 -2.15 0.118
#14 Name3 6.03 3.63 7.78 20.1 -2.15 0.118
#15 Name3 6.51 1.46 5.90 20.1 -2.15 0.118

Or we can use broom package functions. Functions like tidy, glance extract columns of interest (and more). After nesting the grouped dataset, build the model, extract the components with tidy, select the columns of interest and unnest

library(broom)
d1 %>%
group_by(id) %>%
nest %>%
mutate(model1 = map(data, ~
lm(val3 ~ val1 + val2, data = .) %>%
tidy %>%
dplyr::select(term, estimate) %>%
spread(term, estimate) %>%
rename_all(~ c("intc", paste0("coef", 1:2))))) %>%
unnest(model1) %>%
unnest(data)
# A tibble: 15 x 7
# id intc coef1 coef2 val1 val2 val3
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Name1 9.03 -0.0710 0.0212 6.76 2.64 9.85
# 2 Name1 9.03 -0.0710 0.0212 6.78 1.52 6.94
# 3 Name1 9.03 -0.0710 0.0212 4.14 4.31 8.30
# 4 Name1 9.03 -0.0710 0.0212 5.55 2.16 9.97
# 5 Name1 9.03 -0.0710 0.0212 6.18 3.64 8.32
# 6 Name2 7.33 0.468 -0.488 6.08 1.12 9.96
# 7 Name2 7.33 0.468 -0.488 3.54 4.71 6.43
# 8 Name2 7.33 0.468 -0.488 5.66 4.54 8.69
# 9 Name2 7.33 0.468 -0.488 6.88 4.15 7.79
#10 Name2 7.33 0.468 -0.488 4.89 1.27 8.72
#11 Name3 20.1 -2.15 0.118 6.41 4.38 6.22
#12 Name3 20.1 -2.15 0.118 5.06 3.28 9.42
#13 Name3 20.1 -2.15 0.118 6.25 3.16 8.15
#14 Name3 20.1 -2.15 0.118 6.03 3.63 7.78
#15 Name3 20.1 -2.15 0.118 6.51 1.46 5.90

Or without using broom

d1 %>%
group_by(id) %>%
nest %>%
mutate(model = map(data, ~ lm(val3 ~ val1 + val2, data = .) %>%
coef %>%
as.list %>%
as_tibble)) %>%
unnest(model) %>%
unnest(data)

If we only need a summarised output for each 'id'

d1 %>% 
group_by(id) %>%
nest %>%
mutate(model1 = map(data, ~
lm(val3 ~ val1 + val2, data = .) %>%
tidy %>%
dplyr::select(term, estimate) %>%
spread(term, estimate) %>%
rename_all(~ c("intc", paste0("coef", 1:2))))) %>%
dplyr::select(-data) %>%
unnest

Or with data.table, we can do this more concisely, after grouping by 'id' and assigning (:=) the new columns by extracting the coef of the model as a list

library(data.table)
setDT(d1)[, c('intc', 'coef1', 'coef2') :=
as.list(coef(lm(val3 ~ val1 + val2))), id]
d1[order(id)]
# id val1 val2 val3 intc coef1 coef2
# 1: Name1 6.755964 2.642874 9.849828 9.032783 -0.07096932 0.02119088
# 2: Name1 6.776666 1.522431 6.937053 9.032783 -0.07096932 0.02119088
# 3: Name1 4.141883 4.307537 8.301940 9.032783 -0.07096932 0.02119088
# 4: Name1 5.551850 2.163882 9.971588 9.032783 -0.07096932 0.02119088
# 5: Name1 6.179506 3.635832 8.319042 9.032783 -0.07096932 0.02119088
# 6: Name2 6.083243 1.116293 9.960934 7.325156 0.46840770 -0.48806159
# 7: Name2 3.536476 4.708967 6.427627 7.325156 0.46840770 -0.48806159
# 8: Name2 5.663909 4.541081 8.691523 7.325156 0.46840770 -0.48806159
# 9: Name2 6.883746 4.150780 7.791050 7.325156 0.46840770 -0.48806159
#10: Name2 4.890291 1.269559 8.723792 7.325156 0.46840770 -0.48806159
#11: Name3 6.414915 4.383609 6.220188 20.106581 -2.14601530 0.11770877
#12: Name3 5.059774 3.276510 9.421862 20.106581 -2.14601530 0.11770877
#13: Name3 6.251416 3.157157 8.147720 20.106581 -2.14601530 0.11770877
#14: Name3 6.028100 3.630858 7.783118 20.106581 -2.14601530 0.11770877
#15: Name3 6.505153 1.460564 5.895564 20.106581 -2.14601530 0.11770877

Or do a join if we don't want to update the initial data

setDT(d1)[, as.list(coef(lm(val3 ~ val1 + val2))), id][d1, on = .(id)]

NOTE: 'd1' is the initial dataset



Related Topics



Leave a reply



Submit