Split on Factor, Sapply, and Lm

split on factor, sapply, and lm

By the sounds of it, this might be what you're trying to do:

sapply(s, FUN= function(x)
lm(x[ ,"dep"] ~ x[,"ind"])$coefficients[c(1, 2)])
# a b c
# (Intercept) 0.71379430 -0.6817331 0.5717372
# x[, "ind"] 0.07125591 1.1452096 -1.0303726

Other alternatives, if this is what you're looking for

I've seen it noted that in general, if you're splitting and then using s/lapply, you can usually just jump straight to by and skip the split step:

do.call(rbind, 
by(data = df, INDICES=df$subj, FUN=function(x)
lm(x[, "dep"] ~ x[, "ind"])$coefficients[c(1, 2)]))
# (Intercept) x[, "ind"]
# a 0.7137943 0.07125591
# b -0.6817331 1.14520962
# c 0.5717372 -1.03037257

Or, you can use one of the packages that lets you do such sorts of calculations more conveniently, like "data.table":

library(data.table)
DT <- data.table(df)
DT[, list(Int = lm(dep ~ ind)$coefficients[1],
Slo = lm(dep ~ ind)$coefficients[2]), by = subj]
# subj Int Slo
# 1: a 0.7137943 0.07125591
# 2: b -0.6817331 1.14520962
# 3: c 0.5717372 -1.03037257

Split, apply linear model, combine

Try using lmList from nlme

library(nlme)
fits <- lmList(Par1 ~ Par2 | Hour, data=df)

This splits the data by hour and fits linear models for you. Then, you can just do summary(fits). Or, you can look at each summary individually with

lapply(fits, summary)

Splitting by two categories in R

The canonical way in R is to use split:

L <- split(df, df[,c("strata","category")])
L
# $`1.A`
# y x strata category
# 1 -1.120867 1 1 A
# $`2.A`
# y x strata category
# 4 -1.023001 4 2 A
# $`3.A`
# y x strata category
# 7 0.5411806 7 3 A
# $`4.A`
# y x strata category
# 10 1.546789 10 4 A
# $`1.B`
# y x strata category
# 2 0.6730641 2 1 B
# $`2.B`
# y x strata category
# 5 -1.466816 5 2 B
# $`3.B`
# y x strata category
# 8 -0.1955617 8 3 B
# $`4.B`
# y x strata category
# 11 -0.660904 11 4 B
# $`1.C`
# y x strata category
# 3 -0.9880206 3 1 C
# $`2.C`
# y x strata category
# 6 0.4111802 6 2 C
# $`3.C`
# y x strata category
# 9 -0.03311637 9 3 C
# $`4.C`
# y x strata category
# 12 0.6799109 12 4 C

The names of the 12-element list (here) are the string-concatenation of the two categorical variables, .-delimited; this can easily be overridden (manually).

From here, to do regression on every element, you'd likely do something like:

models <- lapply(L, function(x) lm(..., data=x))

(or whichever regression tool you are planning to use).

You can do this in one step if you'd like,

results <- by(df, df[,c("strata","category")], function(x) lm(..., data=x))

The benefit is that it does it in one step. The by return can look a bit odd, but it is really just a list with some special print.by methods being used; you can still reference it just like a list as needed.

Another way to do this in dplyr:

library(dplyr)
results <- df %>%
group_by(strata, category) %>%
summarize(model = list(lm(y ~ x)))
results
# # A tibble: 12 x 3
# # Groups: strata [4]
# strata category model
# <int> <chr> <list>
# 1 1 A <lm>
# 2 1 B <lm>
# 3 1 C <lm>
# 4 2 A <lm>
# 5 2 B <lm>
# 6 2 C <lm>
# 7 3 A <lm>
# 8 3 B <lm>
# 9 3 C <lm>
# 10 4 A <lm>
# 11 4 B <lm>
# 12 4 C <lm>
results$model[[1]]
# Call:
# lm(formula = y ~ x)
# Coefficients:
# (Intercept) x
# -1.121 NA

As pointed out by Onyambu (thank you!), this works well (without data=) because we are explicitly listing the variables, and they will be found. If your regression uses ., for example, you may want to formalize it a little with

results <- df %>%
group_by(strata, category) %>%
summarize(model = list(lm(y ~ ., data = cur_data())))

y~x will work without it, but y~. will not, ergo data=cur_data().

R sapply is.factor

Correct way:

nonFactorCols <- sapply(df1, function(col) !is.factor(col))
# or, more efficiently
nonFactorCols <- !sapply(df1, is.factor)
# or, even more efficiently
nonFactorCols <- !factorCols

Sapply with LM returns a Call function that Stargazer can't use. How do I change that?

stargazer has issues while displaying the output for list of models. A "hack" would be to get data in long format before creating the model.

Since there is no data shared here's a way using mtcars dataset keeping only the first 3 columns from it. In place of your x column I am using disp here.

library(stargazer)

df <- mtcars[1:3]
df1 <- tidyr::pivot_longer(df, cols = -disp)
list_df <- split(df1, df1$name)

lm_model_list <- lapply(list_df, function(x) lm(disp~value, x))

Output -

#For one model
stargazer(lm_model_list$cyl, type = 'text')

===============================================
Dependent variable:
---------------------------
disp
-----------------------------------------------
value 62.599***
(5.469)

Constant -156.609***
(35.181)

-----------------------------------------------
Observations 32
R2 0.814
Adjusted R2 0.807
Residual Std. Error 54.385 (df = 30)
F Statistic 130.999*** (df = 1; 30)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

#For list of models
stargazer(lm_model_list, type = 'text')

==========================================================
Dependent variable:
----------------------------
disp
(1) (2)
----------------------------------------------------------
value 62.599*** -17.429***
(5.469) (1.993)

Constant -156.609*** 580.884***
(35.181) (41.740)

----------------------------------------------------------
Observations 32 32
R2 0.814 0.718
Adjusted R2 0.807 0.709
Residual Std. Error (df = 30) 54.385 66.863
F Statistic (df = 1; 30) 130.999*** 76.513***
==========================================================
Note: *p<0.1; **p<0.05; ***p<0.01

Error using sapply and split to have different pvalues and r^2's on a facet wrap ggplot

Here is an example of taking what you have and organizing the results into a data.frame that contains all the variables you need for plotting. In particular the faceting variables must be present in the dataset.

First you can put the labels and names of each group (combo of sex and day) into a data.frame as columns. You will want to add a columns for the position each of each equation, using the names of the original x and y variables.

lab_dat = data.frame(group = names(lm_equation(tips)),
tip = min(tips$tip),
total_bill = max(tips$total_bill - 10),
label = lm_equation(tips))
lab_dat

group tip total_bill label
Female.Fri Female.Fri 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.72" * "," ~ italic(p) ~ "=" ~ "0.029"
Male.Fri Male.Fri 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.92" * "," ~ italic(p) ~ "=" ~ "0.00017"
Female.Sat Female.Sat 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.50" * "," ~ italic(p) ~ "=" ~ "0.0071"
Male.Sat Male.Sat 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.77" * "," ~ italic(p) ~ "=" ~ "1.4e-12"
Female.Sun Female.Sun 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.74" * "," ~ italic(p) ~ "=" ~ "0.00041"
Male.Sun Male.Sun 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.46" * "," ~ italic(p) ~ "=" ~ "0.00032"
Female.Thur Female.Thur 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.87" * "," ~ italic(p) ~ "=" ~ "9.4e-11"
Male.Thur Male.Thur 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.76" * "," ~ italic(p) ~ "=" ~ "1e-06"

Then you need to take the group variable, which has combined sex and day, and split it back into two separate variables. I use separate() from package tidyr for this. The new variables should be named the same as the variables in the original dataset, since these are the faceting variables and need to be present in a dataset used for any plotting layer.

library(tidyr)
lab_dat = separate(lab_dat, group, c("sex", "day"))
lab_dat

sex day tip total_bill label
Female.Fri Female Fri 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.72" * "," ~ italic(p) ~ "=" ~ "0.029"
Male.Fri Male Fri 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.92" * "," ~ italic(p) ~ "=" ~ "0.00017"
Female.Sat Female Sat 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.50" * "," ~ italic(p) ~ "=" ~ "0.0071"
Male.Sat Male Sat 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.77" * "," ~ italic(p) ~ "=" ~ "1.4e-12"
Female.Sun Female Sun 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.74" * "," ~ italic(p) ~ "=" ~ "0.00041"
Male.Sun Male Sun 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.46" * "," ~ italic(p) ~ "=" ~ "0.00032"
Female.Thur Female Thur 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.87" * "," ~ italic(p) ~ "=" ~ "9.4e-11"
Male.Thur Male Thur 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.76" * "," ~ italic(p) ~ "=" ~ "1e-06"

Now you can plot one label per facet, using lab_dat for the geom_text() layer.

ggplot(tips, aes(tip, total_bill)) +
geom_point(size = 5, alpha = 0.9)+
geom_smooth(method=lm, colour = "#000000", se = FALSE)+
facet_grid(sex ~ day, scales = "free")+
geom_text(data = lab_dat, aes(label = label), parse = TRUE,
vjust = "inward", hjust = "inward")

Linear Regression and group by in R

Here's one way using the lme4 package.

 library(lme4)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))

xyplot(response ~ year, groups=state, data=d, type='l')

fits <- lmList(response ~ year | state, data=d)
fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
(Intercept) year
CA -1.34420990 0.17139963
NY 0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

regression on subsets for unique factor combinations using lm

You could use the plyr package:

require(plyr)
list_reg <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2), function(df)
{lm(Shear~Gap+Clearance+Void,data=df)})
#We have indeed five different results
length(list_reg)
#That's how you check out one particular regression, in this case the first
summary(list_reg[[1]])

The function dlply takes a data.frame (that's what the d... stands for), in your case df1, and returns a list (that's what the .l... stands for), in your case consisting of five elements, each containing the results of one regression.

Internally, your df1 is split up into five sub-data.frames according to the columns specified by .(Surface, Supplier, ParticleSize, T1, T2) and the function lm(Shear~Gap+Clearance+Void,data=df) is applied to every of these sub-data.frames.

To get a better feeling of what dlply really does, just call

list_sub_df <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2))

and you can look at each sub-data.frame on which the lm will be applied to.

And just a general note at the end: The paper by the package author Hadley Wickham is really great: even if you won't end up using his package, it is still really good to get a feeling about the split-apply-combine approach.

EDIT:

I just did a quick search and as expected, this was already explained better before, so also make sure to read this SO post.

EDIT2:

If you want to use the column numbers directly, try this (taken from this SO post):

 list_reg <- dlply(df1, names(df1[, 1:5]), function(df) 
{lm(Shear~Gap+Clearance+Void,data=df)})

Apply lm to subset of data frame defined by a third column of the frame

How about

library(nlme) ## OR library(lme4)
lmList(x~y|ID,data=d)

?



Related Topics



Leave a reply



Submit