Split on Factor, Sapply, and Lm

split on factor, sapply, and lm

By the sounds of it, this might be what you're trying to do:

sapply(s, FUN= function(x)
  lm(x[ ,"dep"] ~ x[,"ind"])$coefficients[c(1, 2)])
#                      a          b          c
# (Intercept) 0.71379430 -0.6817331  0.5717372
# x[, "ind"]  0.07125591  1.1452096 -1.0303726

Other alternatives, if this is what you're looking for

I've seen it noted that in general, if you're splitting and then using s/lapply, you can usually just jump straight to by and skip the split step:

do.call(rbind, 
        by(data = df, INDICES=df$subj, FUN=function(x) 
          lm(x[, "dep"] ~ x[, "ind"])$coefficients[c(1, 2)]))
#   (Intercept)  x[, "ind"]
# a   0.7137943  0.07125591
# b  -0.6817331  1.14520962
# c   0.5717372 -1.03037257

Or, you can use one of the packages that lets you do such sorts of calculations more conveniently, like "data.table":

library(data.table)
DT <- data.table(df)
DT[, list(Int = lm(dep ~ ind)$coefficients[1],
          Slo = lm(dep ~ ind)$coefficients[2]), by = subj]
#    subj        Int         Slo
# 1:    a  0.7137943  0.07125591
# 2:    b -0.6817331  1.14520962
# 3:    c  0.5717372 -1.03037257

Split, apply linear model, combine

Try using lmList from nlme

library(nlme)
fits <- lmList(Par1 ~ Par2 | Hour, data=df)

This splits the data by hour and fits linear models for you. Then, you can just do summary(fits). Or, you can look at each summary individually with

lapply(fits, summary)

Splitting by two categories in R

The canonical way in R is to use split:

L <- split(df, df[,c("strata","category")])
L
# $`1.A`
#           y x strata category
# 1 -1.120867 1      1        A
# $`2.A`
#           y x strata category
# 4 -1.023001 4      2        A
# $`3.A`
#           y x strata category
# 7 0.5411806 7      3        A
# $`4.A`
#           y  x strata category
# 10 1.546789 10      4        A
# $`1.B`
#           y x strata category
# 2 0.6730641 2      1        B
# $`2.B`
#           y x strata category
# 5 -1.466816 5      2        B
# $`3.B`
#            y x strata category
# 8 -0.1955617 8      3        B
# $`4.B`
#            y  x strata category
# 11 -0.660904 11      4        B
# $`1.C`
#            y x strata category
# 3 -0.9880206 3      1        C
# $`2.C`
#           y x strata category
# 6 0.4111802 6      2        C
# $`3.C`
#             y x strata category
# 9 -0.03311637 9      3        C
# $`4.C`
#            y  x strata category
# 12 0.6799109 12      4        C

The names of the 12-element list (here) are the string-concatenation of the two categorical variables, .-delimited; this can easily be overridden (manually).

From here, to do regression on every element, you'd likely do something like:

models <- lapply(L, function(x) lm(..., data=x))

(or whichever regression tool you are planning to use).

You can do this in one step if you'd like,

results <- by(df, df[,c("strata","category")], function(x) lm(..., data=x))

The benefit is that it does it in one step. The by return can look a bit odd, but it is really just a list with some special print.by methods being used; you can still reference it just like a list as needed.

Another way to do this in dplyr:

library(dplyr)
results <- df %>%
  group_by(strata, category) %>%
  summarize(model = list(lm(y ~ x)))
results
# # A tibble: 12 x 3
# # Groups:   strata [4]
#    strata category model 
#     <int> <chr>    <list>
#  1      1 A        <lm>  
#  2      1 B        <lm>  
#  3      1 C        <lm>  
#  4      2 A        <lm>  
#  5      2 B        <lm>  
#  6      2 C        <lm>  
#  7      3 A        <lm>  
#  8      3 B        <lm>  
#  9      3 C        <lm>  
# 10      4 A        <lm>  
# 11      4 B        <lm>  
# 12      4 C        <lm>  
results$model[[1]]
# Call:
# lm(formula = y ~ x)
# Coefficients:
# (Intercept)            x  
#      -1.121           NA

As pointed out by Onyambu (thank you!), this works well (without data=) because we are explicitly listing the variables, and they will be found. If your regression uses ., for example, you may want to formalize it a little with

results <- df %>%
  group_by(strata, category) %>%
  summarize(model = list(lm(y ~ ., data = cur_data())))

y~x will work without it, but y~. will not, ergo data=cur_data().

R sapply is.factor

Correct way:

nonFactorCols <- sapply(df1, function(col) !is.factor(col))
# or, more efficiently
nonFactorCols <- !sapply(df1, is.factor)
# or, even more efficiently
nonFactorCols <- !factorCols

Sapply with LM returns a Call function that Stargazer can't use. How do I change that?

stargazer has issues while displaying the output for list of models. A "hack" would be to get data in long format before creating the model.

Since there is no data shared here's a way using mtcars dataset keeping only the first 3 columns from it. In place of your x column I am using disp here.

library(stargazer)

df <- mtcars[1:3]
df1 <- tidyr::pivot_longer(df, cols = -disp)
list_df <- split(df1, df1$name)

lm_model_list <- lapply(list_df, function(x) lm(disp~value, x))

Output -

#For one model
stargazer(lm_model_list$cyl, type = 'text')

===============================================
                        Dependent variable:    
                    ---------------------------
                               disp            
-----------------------------------------------
value                        62.599***         
                              (5.469)          
                                               
Constant                    -156.609***        
                             (35.181)          
                                               
-----------------------------------------------
Observations                    32             
R2                             0.814           
Adjusted R2                    0.807           
Residual Std. Error      54.385 (df = 30)      
F Statistic           130.999*** (df = 1; 30)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

#For list of models
stargazer(lm_model_list, type = 'text')

==========================================================
                                  Dependent variable:     
                              ----------------------------
                                          disp            
                                   (1)            (2)     
----------------------------------------------------------
value                           62.599***     -17.429***  
                                 (5.469)        (1.993)   
                                                          
Constant                       -156.609***    580.884***  
                                 (35.181)      (41.740)   
                                                          
----------------------------------------------------------
Observations                        32            32      
R2                                0.814          0.718    
Adjusted R2                       0.807          0.709    
Residual Std. Error (df = 30)     54.385        66.863    
F Statistic (df = 1; 30)        130.999***     76.513***  
==========================================================
Note:                          *p<0.1; **p<0.05; ***p<0.01

Error using sapply and split to have different pvalues and r^2's on a facet wrap ggplot

Here is an example of taking what you have and organizing the results into a data.frame that contains all the variables you need for plotting. In particular the faceting variables must be present in the dataset.

First you can put the labels and names of each group (combo of sex and day) into a data.frame as columns. You will want to add a columns for the position each of each equation, using the names of the original x and y variables.

lab_dat = data.frame(group = names(lm_equation(tips)),
                     tip = min(tips$tip),
                     total_bill = max(tips$total_bill - 10),
                     label = lm_equation(tips))
lab_dat

                  group tip total_bill                                                            label
Female.Fri   Female.Fri   1      40.81   ~~italic(r)^2 ~ "=" ~ "0.72" * "," ~ italic(p) ~ "=" ~ "0.029"
Male.Fri       Male.Fri   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.92" * "," ~ italic(p) ~ "=" ~ "0.00017"
Female.Sat   Female.Sat   1      40.81  ~~italic(r)^2 ~ "=" ~ "0.50" * "," ~ italic(p) ~ "=" ~ "0.0071"
Male.Sat       Male.Sat   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.77" * "," ~ italic(p) ~ "=" ~ "1.4e-12"
Female.Sun   Female.Sun   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.74" * "," ~ italic(p) ~ "=" ~ "0.00041"
Male.Sun       Male.Sun   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.46" * "," ~ italic(p) ~ "=" ~ "0.00032"
Female.Thur Female.Thur   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.87" * "," ~ italic(p) ~ "=" ~ "9.4e-11"
Male.Thur     Male.Thur   1      40.81   ~~italic(r)^2 ~ "=" ~ "0.76" * "," ~ italic(p) ~ "=" ~ "1e-06"

Then you need to take the group variable, which has combined sex and day, and split it back into two separate variables. I use separate() from package tidyr for this. The new variables should be named the same as the variables in the original dataset, since these are the faceting variables and need to be present in a dataset used for any plotting layer.

library(tidyr)
lab_dat = separate(lab_dat, group, c("sex", "day"))
lab_dat

               sex  day tip total_bill                                                            label
Female.Fri  Female  Fri   1      40.81   ~~italic(r)^2 ~ "=" ~ "0.72" * "," ~ italic(p) ~ "=" ~ "0.029"
Male.Fri      Male  Fri   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.92" * "," ~ italic(p) ~ "=" ~ "0.00017"
Female.Sat  Female  Sat   1      40.81  ~~italic(r)^2 ~ "=" ~ "0.50" * "," ~ italic(p) ~ "=" ~ "0.0071"
Male.Sat      Male  Sat   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.77" * "," ~ italic(p) ~ "=" ~ "1.4e-12"
Female.Sun  Female  Sun   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.74" * "," ~ italic(p) ~ "=" ~ "0.00041"
Male.Sun      Male  Sun   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.46" * "," ~ italic(p) ~ "=" ~ "0.00032"
Female.Thur Female Thur   1      40.81 ~~italic(r)^2 ~ "=" ~ "0.87" * "," ~ italic(p) ~ "=" ~ "9.4e-11"
Male.Thur     Male Thur   1      40.81   ~~italic(r)^2 ~ "=" ~ "0.76" * "," ~ italic(p) ~ "=" ~ "1e-06"

Now you can plot one label per facet, using lab_dat for the geom_text() layer.

ggplot(tips, aes(tip, total_bill)) +
     geom_point(size = 5, alpha = 0.9)+
     geom_smooth(method=lm, colour = "#000000", se = FALSE)+
     facet_grid(sex ~ day, scales = "free")+
     geom_text(data = lab_dat, aes(label = label), parse = TRUE, 
               vjust = "inward", hjust = "inward")

Linear Regression and group by in R

Here's one way using the lme4 package.

 library(lme4)
 d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                 year=rep(1:10, 2),
                 response=c(rnorm(10), rnorm(10)))

 xyplot(response ~ year, groups=state, data=d, type='l')

 fits <- lmList(response ~ year | state, data=d)
 fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
   (Intercept)        year
CA -1.34420990  0.17139963
NY  0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

regression on subsets for unique factor combinations using lm

You could use the plyr package:

require(plyr)
list_reg <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2), function(df) 
  {lm(Shear~Gap+Clearance+Void,data=df)})
#We have indeed five different results
length(list_reg)
#That's how you check out one particular regression, in this case the first
summary(list_reg[[1]])

The function dlply takes a data.frame (that's what the d... stands for), in your case df1, and returns a list (that's what the .l... stands for), in your case consisting of five elements, each containing the results of one regression.

Internally, your df1 is split up into five sub-data.frames according to the columns specified by .(Surface, Supplier, ParticleSize, T1, T2) and the function lm(Shear~Gap+Clearance+Void,data=df) is applied to every of these sub-data.frames.

To get a better feeling of what dlply really does, just call

list_sub_df <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2))

and you can look at each sub-data.frame on which the lm will be applied to.

And just a general note at the end: The paper by the package author Hadley Wickham is really great: even if you won't end up using his package, it is still really good to get a feeling about the split-apply-combine approach.

EDIT:

I just did a quick search and as expected, this was already explained better before, so also make sure to read this SO post.

EDIT2:

If you want to use the column numbers directly, try this (taken from this SO post):

 list_reg <- dlply(df1, names(df1[, 1:5]), function(df) 
      {lm(Shear~Gap+Clearance+Void,data=df)})

Apply lm to subset of data frame defined by a third column of the frame

How about

library(nlme) ## OR library(lme4)
lmList(x~y|ID,data=d)

Split on Factor, Sapply, and Lm