Linear Model and Dplyr - a Better Solution

Linear model and dplyr - a better solution?

You have several issues here.

  1. If you group your data by 3 variables (or even 2) you don't have enough distinct values in order to run a linear regression model
  2. Pearson requires two numeric values, while Time is a factor which converting to numeric won't make much sense
  3. The third issue here is that you will need to use do in order to run your linear model

Here's an illustration for grouping only on V1

data %>%
group_by(Var1) %>% # You can add here additional grouping variables if your real data set enables it
do(mod = lm(Temp ~ Time, data = .)) %>%
mutate(Slope = summary(mod)$coeff[2]) %>%
select(-mod)
# Source: local data frame [3 x 2]
# Groups: <by row>
#
# Var1 Slope
# 1 a 12.66667
# 2 b -2.50000
# 3 c -31.33333

If you do have two numeric variables, you can use do in order to calculate correlation too, for example (I will create some dummy numeric variables for illustration)

data %>%
mutate(test1 = sample(1:3, n(), replace = TRUE), # Creating some numeric variables
test2 = sample(1:3, n(), replace = TRUE)) %>%
group_by(Var1) %>%
do(mod = lm(Temp ~ Time, data = .),
mod2 = cor(.$test1, .$test2, method = "pearson")) %>%
mutate(Slope = summary(mod)$coeff[2],
Pearson = mod2[1]) %>%
select(-mod, -mod2)

# Source: local data frame [3 x 3]
# Groups: <by row>
#
# Var1 Slope Pearson
# 1 a 12.66667 0.25264558
# 2 b -2.50000 -0.09090909
# 3 c -31.33333 0.30151134

Bonus solution: you can do this quite efficiently/easily with data.table package too

library(data.table)
setDT(data)[, list(Slope = summary(lm(Temp ~ Time))$coeff[2]), Var1]
# Var1 Slope
# 1: a 12.66667
# 2: b -2.50000
# 3: c -31.33333

Or if we want to create some dummy variables too

library(data.table)
setDT(data)[, `:=`(test1 = sample(1:3, .N, replace = TRUE),
test2 = sample(1:3, .N, replace = TRUE))][,
list(Slope = summary(lm(Temp ~ Time))$coeff[2],
Pearson = cor(test1, test2, method = "pearson")), Var1]
# Var1 Slope Pearson
# 1: a 12.66667 -0.02159168
# 2: b -2.50000 -0.81649658
# 3: c -31.33333 -1.00000000

Fitting several regression models with dplyr

As of mid 2020 (and updated to fit dplyr 1.0+ as of 2022-04), tchakravarty's answer will fail. In order to circumvent the new approach of broom and dpylr seem to interact, the following combination of broom::tidy, broom::augment and broom::glance can be used. We just have to use them in conjunvtion with nest_by() and summarize() (previously inside do() and later unnest() the tibble).

library(dplyr)
library(broom)
library(tidyr)

set.seed(42)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(tidy(mod))
# # A tibble: 72 × 6
# # Groups: hour [24]
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 87.4 15.8 5.55 0.0000289
# 2 1 wind -0.0129 0.0120 -1.08 0.296
# 3 1 temp 0.588 0.849 0.693 0.497
# 4 2 (Intercept) 92.3 21.6 4.27 0.000466
# 5 2 wind -0.0227 0.0134 -1.69 0.107
# 6 2 temp -0.216 0.841 -0.257 0.800
# 7 3 (Intercept) 61.1 18.6 3.29 0.00409
# 8 3 wind 0.00471 0.0128 0.367 0.718
# 9 3 temp 0.425 0.964 0.442 0.664
# 10 4 (Intercept) 31.6 15.3 2.07 0.0529

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(augment(mod))
# # A tibble: 504 × 10
# # Groups: hour [24]
# hour price wind temp .fitted .resid .hat .sigma .cooksd .std.resid
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 113. 288. -1.75 82.7 30.8 0.123 37.8 0.0359 0.877
# 2 1 117. 2234. 18.4 69.5 47.0 0.201 36.4 0.165 1.40
# 3 1 28.6 1438. 4.75 71.7 -43.1 0.0539 37.1 0.0265 -1.18
# 4 1 102. 366. 9.77 88.5 13.7 0.151 38.4 0.00926 0.395
# 5 1 76.6 2257. -4.69 55.6 21.0 0.245 38.2 0.0448 0.644
# 6 1 60.1 633. -3.18 77.4 -17.3 0.0876 38.4 0.00749 -0.484
# 7 1 89.4 376. -4.16 80.1 9.31 0.119 38.5 0.00314 0.264
# 8 1 8.18 1921. 19.2 74.0 -65.9 0.173 34.4 0.261 -1.93
# 9 1 78.7 575. -6.11 76.4 2.26 0.111 38.6 0.000170 0.0640
# 10 1 85.2 763. -0.618 77.2 7.94 0.0679 38.6 0.00117 0.219
# # … with 494 more rows

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(glance(mod))
# # A tibble: 24 × 13
# # Groups: hour [24]
# hour r.squared adj.r.squared sigma statistic p.value df logLik AIC
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.0679 -0.0357 37.5 0.655 0.531 2 -104. 217.
# 2 2 0.139 0.0431 42.7 1.45 0.261 2 -107. 222.
# 3 3 0.0142 -0.0953 43.1 0.130 0.879 2 -107. 222.
# 4 4 0.0737 -0.0293 36.7 0.716 0.502 2 -104. 216.
# 5 5 0.213 0.126 37.8 2.44 0.115 2 -104. 217.
# 6 6 0.0813 -0.0208 33.5 0.796 0.466 2 -102. 212.
# 7 7 0.0607 -0.0437 40.7 0.582 0.569 2 -106. 220.
# 8 8 0.153 0.0592 36.3 1.63 0.224 2 -104. 215.
# 9 9 0.166 0.0736 36.5 1.79 0.195 2 -104. 216.
# 10 10 0.110 0.0108 40.0 1.11 0.351 2 -106. 219.
# # … with 14 more rows, and 4 more variables: BIC <dbl>, deviance <dbl>,
# # df.residual <int>, nobs <int>

Credits to Bob Muenchen's Blog for inspiration on that.

How to use dplyr pipe to perform linear model

Keep it simple. Don't overcomplicate.

fit_conc_df <- dat %>% lm(y ~ x, data = .)
fit_conc_df %>%
glance()

# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.910 0.880 68.0 30.3 0.0118 1 -26.9 59.8 58.7
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The reason your code was failing is because the pipe operator is passing the data as the first argument to the lm() function but you were also providing an argument name fit_conc.

Also, you can create the data frame/tibble much more concisely as follows:

 dat <- tibble(x = c(50, 100, 200, 400, 800), 
y = c(110, 219, 323, 467, 605))

dplyr get linear regression coefficients

Here is a combination of tidyverse and broom package to get your desired output.

Very handy here is group_split -> you get a list and then you iterate with purrrs map_dfr (by the way with map_dfr you get a dataframe otherwise with map you get a list) your regression lm(... through each list element. Using brooms glance gives the desired output:

library(tidyverse)
library(broom)

mydata %>%
pivot_longer(starts_with("Site"),
names_to = ".value",
names_pattern = "(^Site)") %>%
mutate(Site=as.factor(Site)) %>%
group_by(Site) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(Outcome ~ Age+Gender, data=df) %>%
glance() %>%
add_column(Site = unique(df$Site), .before = 1)
})
  Site  r.squared adj.r.squared    sigma statistic  p.value    df logLik    AIC     BIC deviance df.residual  nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 0.6 0.44 3.87e- 1 3.75e+ 0 1.01e- 1 2 -1.88 11.8 12.1 7.5 e- 1 5 8
2 2 1 1 2.22e-16 1.01e+31 2.22e-16 2 141. -275. -277. 4.93e-32 1 4
3 3 0.351 -0.946 6.97e- 1 2.71e- 1 8.05e- 1 2 -1.46 10.9 8.47 4.86e- 1 1 4

R Dplyr Extracting Estimates from Linear Model and use them for Variable Adjustment Mutattions - for Multiple Columns

Your residual() function has three arguments, but you don't actually need the data argument. In lm(), the data argument is only required if x and y are column names, but in this case they contain the entire vectors needed to compute the regression.

Just reduce the arguments list of residual() to residual(nutrient, energy), and remove the data arguments from the lm() calls in residual(), and your function will execute without error.

With:

residual <- function(nutrient, energy){
mod <- lm(nutrient ~ energy)
(nutrient - (mod$coefficient[1] + mod$coefficient[2] * energy)) +
(mod$coefficient[1] + mod$coefficient[2] * mean(energy))
}

Then:

df %>% mutate_at(vars(protein, fat), funs(residual(., energy)))
energy protein fat
1 3582 70.27792 46.73896
2 3703 70.60333 46.50843
3 3810 72.33200 49.12606
4 3909 72.24825 48.32835
5 4047 76.23757 55.86791
...

Manipulating data for Regression Model using dplyr in R

If you want to control the number of days after each month (or in each month) you could filter by the date not the row numbers.

I'm sure it can be tidied up more than this, but you would just need to change the forecast_date <- as.Date("2021-04-01") to whichever month you want to forecast.

##set the forecast month. This should be straight forward to automate with a list or an increment
forcast_date <- as.Date("2021-04-01") # April

##get the forecast month length. This would be used for the data_feb_add and data_mar_add step.
forcast_month_length <- days_in_month(forcast_date) #30 days

##get dates for the previous 3 months
month_1_date <- forcast_date %m-% months(3)
month_2_date <- forcast_date %m-% months(2)
month_3_date <- forcast_date %m-% months(1)

##find the shortest month in that time range.
shortest_month <- min(c(days_in_month(month_1_date),
days_in_month(month_2_date),
days_in_month(month_2_date))) #28 days

##select the first 28 days (the shortest month) for each of the months used for the variables
data_month_1 <- mydata[mydata$datex %in% month_1_date:(month_1_date + shortest_month - 1),]
data_month_2 <- mydata[mydata$datex %in% month_2_date:(month_2_date + shortest_month - 1),]
data_month_3 <- mydata[mydata$datex %in% month_3_date:(month_3_date + shortest_month - 1),]

##select the number of days needed for each month for the forecast data (30 days for april)
month_2_forecast_length <- mydata[mydata$datex %in% month_2_date:(month_2_date + forcast_month_length - 1),]
month_3_forecast_length <- mydata[mydata$datex %in% month_3_date:(month_3_date + forcast_month_length - 1),]

Regression imputation with dplyr in R

library(dplyr)

fit <- lm(p ~ ., data = select(df, p, starts_with("var")))

df %>%
rowwise() %>%
mutate(p = ifelse(is.na(p), predict(fit, newdata = across()), p)) %>%
ungroup()

How it works

  • For starters, when fitting your model, you can subset your data frame using select and any of the tidyselect helpers to select your dependent variables (here used starts_with("var")). This subset data frame then allows you to use the ~ . notation which means regress p on everything in the subset data frame.
  • Next you create a row-wise data frame and use your model to predict where p is missing. In this instance across turns each row into a 1x6 tibble that you can pass to the newdata argument. predict then uses the model fit and this new data to predict a value of p.

Output

     id group sub_group     p  var1  var2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 4.3 0.3 0
2 1 1 2 5.7 0.1 0
3 1 2 3 3.60 0.4 0
4 2 1 1 5.10 0.9 1
5 2 1 2 10.7 0.1 1
6 2 2 3 10 0.2 1

Benchmarking

As mentioned in the comments, for large data frames the rowwise operation takes significantly longer than some other options:

library(microbenchmark)

set.seed(1)
df1 <- df %>%
slice_sample(n = 1E5, replace = T)

fit <- lm(p ~ ., data = select(df1, p, starts_with("var")))

dplyr_rowwise <- function(){
df1 %>%
rowwise() %>%
mutate(p = ifelse(is.na(p), predict(fit, newdata = across()), p)) %>%
ungroup()
}

dplyr_coalesce <- function(){
df1 %>%
mutate(p = coalesce(p, predict(fit, newdata = df1)))
}

base_index <- function(){
isna <- is.na(df1$p)
df1$p[isna] <- predict(fit, newdata = subset(df1, isna))
}

microbenchmark(
dplyr_rowwise(),
dplyr_coalesce(),
base_index(),
times = 10L
)

Unit: milliseconds
expr min lq mean median uq
dplyr_rowwise() 63739.9512 64441.0800 66926.46041 65513.51785 66923.0241
dplyr_coalesce() 6.5901 6.9037 8.55971 7.21125 7.7157
base_index() 13.0368 13.1790 15.73682 13.53310 19.3004

dplyr version of grouping a dataframe then creating regression model on each group

Returning a list from dplyr is not possible yet. If you just need the intercept and slope @jazzurro 's answer is the way, but if you need the whole model you need to do something like

library(dplyr)
models <- df %>% group_by(country) %>% do(mod = lm(BirthRate ~ US., data = .))

Then if you want to perform ANOVA on each fitted model, you can do it using rowwise

models %>% rowwise %>% do(anova(.$mod))

but again the result is coerced to a data frame and is not quite the same as doing lapply(models$mod, anova).

For now (ie until the next version of dplyr) if you need to store the whole result in a list, you can just use dlply from plyr, like plyr::dlply(df, "country", function(d) anova(lm(BirthRate ~ US., data = d))), or of course if you do not absolutely have to use dplyr you can go for @SvenHohenstein 's answer which looks like a better way of doing this anyway.



Related Topics



Leave a reply



Submit