split on factor, sapply, and lm
By the sounds of it, this might be what you're trying to do:
sapply(s, FUN= function(x)
lm(x[ ,"dep"] ~ x[,"ind"])$coefficients[c(1, 2)])
# a b c
# (Intercept) 0.71379430 -0.6817331 0.5717372
# x[, "ind"] 0.07125591 1.1452096 -1.0303726
Other alternatives, if this is what you're looking for
I've seen it noted that in general, if you're splitting and then using s/lapply
, you can usually just jump straight to by
and skip the split
step:
do.call(rbind,
by(data = df, INDICES=df$subj, FUN=function(x)
lm(x[, "dep"] ~ x[, "ind"])$coefficients[c(1, 2)]))
# (Intercept) x[, "ind"]
# a 0.7137943 0.07125591
# b -0.6817331 1.14520962
# c 0.5717372 -1.03037257
Or, you can use one of the packages that lets you do such sorts of calculations more conveniently, like "data.table":
library(data.table)
DT <- data.table(df)
DT[, list(Int = lm(dep ~ ind)$coefficients[1],
Slo = lm(dep ~ ind)$coefficients[2]), by = subj]
# subj Int Slo
# 1: a 0.7137943 0.07125591
# 2: b -0.6817331 1.14520962
# 3: c 0.5717372 -1.03037257
Split, apply linear model, combine
Try using lmList
from nlme
library(nlme)
fits <- lmList(Par1 ~ Par2 | Hour, data=df)
This splits the data by hour and fits linear models for you. Then, you can just do summary(fits)
. Or, you can look at each summary individually with
lapply(fits, summary)
Splitting by two categories in R
The canonical way in R is to use split
:
L <- split(df, df[,c("strata","category")])
L
# $`1.A`
# y x strata category
# 1 -1.120867 1 1 A
# $`2.A`
# y x strata category
# 4 -1.023001 4 2 A
# $`3.A`
# y x strata category
# 7 0.5411806 7 3 A
# $`4.A`
# y x strata category
# 10 1.546789 10 4 A
# $`1.B`
# y x strata category
# 2 0.6730641 2 1 B
# $`2.B`
# y x strata category
# 5 -1.466816 5 2 B
# $`3.B`
# y x strata category
# 8 -0.1955617 8 3 B
# $`4.B`
# y x strata category
# 11 -0.660904 11 4 B
# $`1.C`
# y x strata category
# 3 -0.9880206 3 1 C
# $`2.C`
# y x strata category
# 6 0.4111802 6 2 C
# $`3.C`
# y x strata category
# 9 -0.03311637 9 3 C
# $`4.C`
# y x strata category
# 12 0.6799109 12 4 C
The names of the 12-element list (here) are the string-concatenation of the two categorical variables, .
-delimited; this can easily be overridden (manually).
From here, to do regression on every element, you'd likely do something like:
models <- lapply(L, function(x) lm(..., data=x))
(or whichever regression tool you are planning to use).
You can do this in one step if you'd like,
results <- by(df, df[,c("strata","category")], function(x) lm(..., data=x))
The benefit is that it does it in one step. The by
return can look a bit odd, but it is really just a list
with some special print.by
methods being used; you can still reference it just like a list as needed.
Another way to do this in dplyr
:
library(dplyr)
results <- df %>%
group_by(strata, category) %>%
summarize(model = list(lm(y ~ x)))
results
# # A tibble: 12 x 3
# # Groups: strata [4]
# strata category model
# <int> <chr> <list>
# 1 1 A <lm>
# 2 1 B <lm>
# 3 1 C <lm>
# 4 2 A <lm>
# 5 2 B <lm>
# 6 2 C <lm>
# 7 3 A <lm>
# 8 3 B <lm>
# 9 3 C <lm>
# 10 4 A <lm>
# 11 4 B <lm>
# 12 4 C <lm>
results$model[[1]]
# Call:
# lm(formula = y ~ x)
# Coefficients:
# (Intercept) x
# -1.121 NA
As pointed out by Onyambu (thank you!), this works well (without data=
) because we are explicitly listing the variables, and they will be found. If your regression uses .
, for example, you may want to formalize it a little with
results <- df %>%
group_by(strata, category) %>%
summarize(model = list(lm(y ~ ., data = cur_data())))
y~x
will work without it, but y~.
will not, ergo data=cur_data()
.
R sapply is.factor
Correct way:
nonFactorCols <- sapply(df1, function(col) !is.factor(col))
# or, more efficiently
nonFactorCols <- !sapply(df1, is.factor)
# or, even more efficiently
nonFactorCols <- !factorCols
Sapply with LM returns a Call function that Stargazer can't use. How do I change that?
stargazer
has issues while displaying the output for list of models. A "hack" would be to get data in long format before creating the model.
Since there is no data shared here's a way using mtcars
dataset keeping only the first 3 columns from it. In place of your x
column I am using disp
here.
library(stargazer)
df <- mtcars[1:3]
df1 <- tidyr::pivot_longer(df, cols = -disp)
list_df <- split(df1, df1$name)
lm_model_list <- lapply(list_df, function(x) lm(disp~value, x))
Output -
#For one model
stargazer(lm_model_list$cyl, type = 'text')
===============================================
Dependent variable:
---------------------------
disp
-----------------------------------------------
value 62.599***
(5.469)
Constant -156.609***
(35.181)
-----------------------------------------------
Observations 32
R2 0.814
Adjusted R2 0.807
Residual Std. Error 54.385 (df = 30)
F Statistic 130.999*** (df = 1; 30)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
#For list of models
stargazer(lm_model_list, type = 'text')
==========================================================
Dependent variable:
----------------------------
disp
(1) (2)
----------------------------------------------------------
value 62.599*** -17.429***
(5.469) (1.993)
Constant -156.609*** 580.884***
(35.181) (41.740)
----------------------------------------------------------
Observations 32 32
R2 0.814 0.718
Adjusted R2 0.807 0.709
Residual Std. Error (df = 30) 54.385 66.863
F Statistic (df = 1; 30) 130.999*** 76.513***
==========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Error using sapply and split to have different pvalues and r^2's on a facet wrap ggplot
Here is an example of taking what you have and organizing the results into a data.frame that contains all the variables you need for plotting. In particular the faceting variables must be present in the dataset.
First you can put the labels and names of each group (combo of sex
and day
) into a data.frame as columns. You will want to add a columns for the position each of each equation, using the names of the original x
and y
variables.
lab_dat = data.frame(group = names(lm_equation(tips)),
tip = min(tips$tip),
total_bill = max(tips$total_bill - 10),
label = lm_equation(tips))
lab_dat
group tip total_bill label
Female.Fri Female.Fri 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.72" * "," ~ italic(p) ~ "=" ~ "0.029"
Male.Fri Male.Fri 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.92" * "," ~ italic(p) ~ "=" ~ "0.00017"
Female.Sat Female.Sat 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.50" * "," ~ italic(p) ~ "=" ~ "0.0071"
Male.Sat Male.Sat 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.77" * "," ~ italic(p) ~ "=" ~ "1.4e-12"
Female.Sun Female.Sun 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.74" * "," ~ italic(p) ~ "=" ~ "0.00041"
Male.Sun Male.Sun 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.46" * "," ~ italic(p) ~ "=" ~ "0.00032"
Female.Thur Female.Thur 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.87" * "," ~ italic(p) ~ "=" ~ "9.4e-11"
Male.Thur Male.Thur 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.76" * "," ~ italic(p) ~ "=" ~ "1e-06"
Then you need to take the group
variable, which has combined sex
and day
, and split it back into two separate variables. I use separate()
from package tidyr for this. The new variables should be named the same as the variables in the original dataset, since these are the faceting variables and need to be present in a dataset used for any plotting layer.
library(tidyr)
lab_dat = separate(lab_dat, group, c("sex", "day"))
lab_dat
sex day tip total_bill label
Female.Fri Female Fri 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.72" * "," ~ italic(p) ~ "=" ~ "0.029"
Male.Fri Male Fri 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.92" * "," ~ italic(p) ~ "=" ~ "0.00017"
Female.Sat Female Sat 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.50" * "," ~ italic(p) ~ "=" ~ "0.0071"
Male.Sat Male Sat 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.77" * "," ~ italic(p) ~ "=" ~ "1.4e-12"
Female.Sun Female Sun 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.74" * "," ~ italic(p) ~ "=" ~ "0.00041"
Male.Sun Male Sun 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.46" * "," ~ italic(p) ~ "=" ~ "0.00032"
Female.Thur Female Thur 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.87" * "," ~ italic(p) ~ "=" ~ "9.4e-11"
Male.Thur Male Thur 1 40.81 ~~italic(r)^2 ~ "=" ~ "0.76" * "," ~ italic(p) ~ "=" ~ "1e-06"
Now you can plot one label per facet, using lab_dat
for the geom_text()
layer.
ggplot(tips, aes(tip, total_bill)) +
geom_point(size = 5, alpha = 0.9)+
geom_smooth(method=lm, colour = "#000000", se = FALSE)+
facet_grid(sex ~ day, scales = "free")+
geom_text(data = lab_dat, aes(label = label), parse = TRUE,
vjust = "inward", hjust = "inward")
Linear Regression and group by in R
Here's one way using the lme4
package.
library(lme4)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))
xyplot(response ~ year, groups=state, data=d, type='l')
fits <- lmList(response ~ year | state, data=d)
fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
(Intercept) year
CA -1.34420990 0.17139963
NY 0.00196176 -0.01852429
Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316
regression on subsets for unique factor combinations using lm
You could use the plyr
package:
require(plyr)
list_reg <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2), function(df)
{lm(Shear~Gap+Clearance+Void,data=df)})
#We have indeed five different results
length(list_reg)
#That's how you check out one particular regression, in this case the first
summary(list_reg[[1]])
The function dlply
takes a data.frame
(that's what the d... stands for), in your case df1
, and returns a list (that's what the .l... stands for), in your case consisting of five elements, each containing the results of one regression.
Internally, your df1
is split up into five sub-data.frames according to the columns specified by .(Surface, Supplier, ParticleSize, T1, T2)
and the function lm(Shear~Gap+Clearance+Void,data=df)
is applied to every of these sub-data.frames.
To get a better feeling of what dlply
really does, just call
list_sub_df <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2))
and you can look at each sub-data.frame on which the lm
will be applied to.
And just a general note at the end: The paper by the package author Hadley Wickham is really great: even if you won't end up using his package, it is still really good to get a feeling about the split-apply-combine approach.
EDIT:
I just did a quick search and as expected, this was already explained better before, so also make sure to read this SO post.
EDIT2:
If you want to use the column numbers directly, try this (taken from this SO post):
list_reg <- dlply(df1, names(df1[, 1:5]), function(df)
{lm(Shear~Gap+Clearance+Void,data=df)})
Apply lm to subset of data frame defined by a third column of the frame
How about
library(nlme) ## OR library(lme4)
lmList(x~y|ID,data=d)
?
Related Topics
Shiny Leaflet Easyprint Plugin
Error When Mapping in Ggmap with API Key (403 Forbidden)
Interleave Columns of Two Data Frames
Bar Plot for Count Data by Group in R
How to Use R to Create a Word Co-Occurrence Matrix
Subsetting Data Based on Dynamic Column Names
How to Read Column Names 'As Is' from CSV File
Adding Manual Legend in Ggplot
Get First Entries in Rows of List
Ggplot2: How to Rotate a Graph in a Specific Angle
"Non-Finite Function Value" When Using Integrate() in R
R: Split String into Numeric and Return the Mean as a New Column in a Data Frame
Counting the Number of Values Greater Than 0 in R in Multiple Columns
R - Random Forest and More Than 53 Categories
Duplicate Couples (Id-Time) Error in Plm with Only Two Ids