Loess Regression on Each Group with Dplyr::Group_By()

loess regression on each group with dplyr::group_by()

You may have already figured this out -- but if not, here's some help.

Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.

So for your case:

fems <- fems %>% 
group_by(CpG) %>%
arrange(CpG, AVGMOrder) %>%
mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))

Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.

Predict specific x-value after loess regression on each group with dplyr::group_by()

Feels like I'm making this too simple but don't you just want...

map(model$m, ~ predict(.x, newdata = 70))
[[1]]
[1] 88.66499

[[2]]
[1] 94.66321

Reversing the direction of the prediction since it's bivariate

library(dplyr)
library(purrr)
library(tidyr)
model <- df %>%
nest(-sample) %>%
drop_na() %>%
group_by(sample) %>%
mutate(m = purrr::map(data, loess, # Perform loess calculation on each sample_long group
formula = time ~ measurement, span = 0.25), # Make span as small as possible in order to draw the nearest straighest line
fitted = purrr::map(m, `[[`, "fitted")) # Retrieve the fitted values from each model
#> Warning: Problem with `mutate()` input `m`.
#> x pseudoinverse used at 90.2
#> ℹ Input `m` is `purrr::map(data, loess, formula = time ~ measurement, span = 0.25)`.

names(model$m) <- model$sample
map(model$m, ~ predict(.x, newdata = 70))
#> $cat
#> [1] 17.08772
#>
#> $dog
#> [1] 15.03579

How can apply a loess function and get predictions by groups using dplyr in r?

If you want to get loess predictions for each country, you might want to use a nest()ed data frame. This will let you set up a column that contains data frames for country-specific data, and then run loess() and predict() on those individual data frames, then unnest() to bring the results back into a standard format.

Here's some code that nests your data, runs the analysis on each country, then pulls it back out to a regular data frame:

library(tidyverse)

data.1.with.pred <- data.1 %>%
group_by(country) %>%
arrange(country, year) %>%
nest() %>%
mutate(pred.response = purrr::map(data, function(x)stats::loess(response~year, span= 0.75, data = x) %>%
stats::predict(data.frame(year = seq(min(x$year), max(x$year), 1))))) %>%
unnest(cols = c(data, pred.response))

data.1.with.pred %>%
ggplot() +
geom_point(aes(x = year, y = response, colour = country)) +
geom_line(aes(x = year,y=pred.response, colour = country))

The resulting data frame has annual loess predictions for each country, as opposed to for all countries together, and the plot looks like this:
a ggplot

Is this what you were trying to do?

Loess smooth extracted values by group errors

First group by id, then fit a model to each group's data and use broom::augment to extract the fitted values and, as a bonus, the residuals. This preserves the inputs x and y; it will be easier to plot x vs .fitted for example.

library("tidyverse")

test %>%
group_by(id) %>%
group_modify(
# .x refers to the subset of rows that belong to a group.
# It's a smaller data frame with the same columns as the input
# but fewer rows.
~ loess(y ~ x, span = .5, data = .x) %>% broom::augment()
)
#> # A tibble: 145 × 5
#> # Groups: id [3]
#> id y x .fitted .resid
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.658 -8.10 0.834 -0.176
#> 2 1 0.738 -8.06 0.897 -0.159
#> 3 1 0.846 -8.01 0.989 -0.142
#> 4 1 0.906 -7.96 0.975 -0.0688
#> 5 1 0.964 -7.91 0.982 -0.0180
#> 6 1 0.971 -7.90 0.983 -0.0121
#> 7 1 0.962 -7.90 0.980 -0.0179
#> 8 1 0.953 -7.91 0.981 -0.0289
#> 9 1 0.943 -7.91 0.983 -0.0404
#> 10 1 0.928 -7.92 0.977 -0.0487
#> # … with 135 more rows

The original data frame might have more columns than x and y that are not used in the loess fit. To keep those columns, pass the group data .x to the augment function as well.

test %>%
mutate(
# Extra columns that we don't need for the loess fit but we want to keep.
z = rnorm(n()),
w = rnorm(n())
) %>%
group_by(id) %>%
group_modify(
# Now `broom::augment` appends .fitted and .resid to the original columns.
~ loess(y ~ x, span = .5, data = .x) %>% broom::augment(.x)
)
#> # A tibble: 145 × 7
#> # Groups: id [3]
#> id x y z w .fitted .resid
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 -8.10 0.658 1.31 -0.199 0.834 -0.176
#> 2 1 -8.06 0.738 0.395 -1.74 0.897 -0.159
#> 3 1 -8.01 0.846 -1.34 0.382 0.989 -0.142
#> 4 1 -7.96 0.906 0.376 0.231 0.975 -0.0688
#> 5 1 -7.91 0.964 -0.224 0.327 0.982 -0.0180
#> 6 1 -7.90 0.971 0.678 -0.504 0.983 -0.0121
#> 7 1 -7.90 0.962 0.736 -0.0186 0.980 -0.0179
#> 8 1 -7.91 0.953 0.368 0.0313 0.981 -0.0289
#> 9 1 -7.91 0.943 -1.35 1.02 0.983 -0.0404
#> 10 1 -7.92 0.928 0.280 -0.471 0.977 -0.0487
#> # … with 135 more rows

Created on 2022-03-08 by the reprex package (v2.0.1)

Fit loess smoothers for multiple groups across multiple numeric variables

We can get the data in long format using. pivot_longer, group_by Animal and column name and apply loess to each combinaton.

library(dplyr)
library(tidyr)

TwoVarDF %>%
pivot_longer(cols = starts_with('Var')) %>%
group_by(Animal, name) %>%
mutate(model = loess(value~Day, span = 0.3)$fitted)

regression by group and retain all the columns in R

Use group_split then loop through each group using map_dfr to bind ID, ID2 and augment output using bind_cols

library(dplyr)
library(purrr)
dat %>% group_split(ID3) %>%
map_dfr(~bind_cols(select(.x,ID,ID2), augment(lm(value~yearRef, data=.x))), .id = "ID3")

# A tibble: 18,576 x 12
ID3 ID ID2 value yearRef .fitted .se.fit .resid .hat .sigma .cooksd
<chr> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 196 16 -0.385 2009 -0.0406 0.0308 -0.344 1.00e-3 0.973 6.27e-5
2 1 372 47 -0.793 2012 -0.0676 0.0414 -0.726 1.81e-3 0.973 5.05e-4
3 1 470 15 -0.496 2011 -0.0586 0.0374 -0.438 1.48e-3 0.973 1.50e-4
4 1 242 40 -1.13 2010 -0.0496 0.0338 -1.08 1.21e-3 0.973 7.54e-4
5 1 471 34 1.28 2006 -0.0135 0.0262 1.29 7.26e-4 0.972 6.39e-4
6 1 434 35 -1.09 1998 0.0586 0.0496 -1.15 2.61e-3 0.973 1.82e-3
7 1 467 45 -0.0663 2011 -0.0586 0.0374 -0.00769 1.48e-3 0.973 4.64e-8
8 1 334 27 -1.37 2003 0.0135 0.0305 -1.38 9.86e-4 0.972 9.92e-4
9 1 186 25 -0.0195 2003 0.0135 0.0305 -0.0331 9.86e-4 0.973 5.71e-7
10 1 114 34 1.09 2014 -0.0857 0.0500 1.18 2.64e-3 0.973 1.94e-3
# ... with 18,566 more rows, and 1 more variable: .std.resid <dbl>

Fitting several regression models with dplyr

As of mid 2020 (and updated to fit dplyr 1.0+ as of 2022-04), tchakravarty's answer will fail. In order to circumvent the new approach of broom and dpylr seem to interact, the following combination of broom::tidy, broom::augment and broom::glance can be used. We just have to use them in conjunvtion with nest_by() and summarize() (previously inside do() and later unnest() the tibble).

library(dplyr)
library(broom)
library(tidyr)

set.seed(42)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(tidy(mod))
# # A tibble: 72 × 6
# # Groups: hour [24]
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 87.4 15.8 5.55 0.0000289
# 2 1 wind -0.0129 0.0120 -1.08 0.296
# 3 1 temp 0.588 0.849 0.693 0.497
# 4 2 (Intercept) 92.3 21.6 4.27 0.000466
# 5 2 wind -0.0227 0.0134 -1.69 0.107
# 6 2 temp -0.216 0.841 -0.257 0.800
# 7 3 (Intercept) 61.1 18.6 3.29 0.00409
# 8 3 wind 0.00471 0.0128 0.367 0.718
# 9 3 temp 0.425 0.964 0.442 0.664
# 10 4 (Intercept) 31.6 15.3 2.07 0.0529

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(augment(mod))
# # A tibble: 504 × 10
# # Groups: hour [24]
# hour price wind temp .fitted .resid .hat .sigma .cooksd .std.resid
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 113. 288. -1.75 82.7 30.8 0.123 37.8 0.0359 0.877
# 2 1 117. 2234. 18.4 69.5 47.0 0.201 36.4 0.165 1.40
# 3 1 28.6 1438. 4.75 71.7 -43.1 0.0539 37.1 0.0265 -1.18
# 4 1 102. 366. 9.77 88.5 13.7 0.151 38.4 0.00926 0.395
# 5 1 76.6 2257. -4.69 55.6 21.0 0.245 38.2 0.0448 0.644
# 6 1 60.1 633. -3.18 77.4 -17.3 0.0876 38.4 0.00749 -0.484
# 7 1 89.4 376. -4.16 80.1 9.31 0.119 38.5 0.00314 0.264
# 8 1 8.18 1921. 19.2 74.0 -65.9 0.173 34.4 0.261 -1.93
# 9 1 78.7 575. -6.11 76.4 2.26 0.111 38.6 0.000170 0.0640
# 10 1 85.2 763. -0.618 77.2 7.94 0.0679 38.6 0.00117 0.219
# # … with 494 more rows

df.h %>%
nest_by(hour) %>%
mutate(mod = list(lm(price ~ wind + temp, data = data))) %>%
summarize(glance(mod))
# # A tibble: 24 × 13
# # Groups: hour [24]
# hour r.squared adj.r.squared sigma statistic p.value df logLik AIC
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 0.0679 -0.0357 37.5 0.655 0.531 2 -104. 217.
# 2 2 0.139 0.0431 42.7 1.45 0.261 2 -107. 222.
# 3 3 0.0142 -0.0953 43.1 0.130 0.879 2 -107. 222.
# 4 4 0.0737 -0.0293 36.7 0.716 0.502 2 -104. 216.
# 5 5 0.213 0.126 37.8 2.44 0.115 2 -104. 217.
# 6 6 0.0813 -0.0208 33.5 0.796 0.466 2 -102. 212.
# 7 7 0.0607 -0.0437 40.7 0.582 0.569 2 -106. 220.
# 8 8 0.153 0.0592 36.3 1.63 0.224 2 -104. 215.
# 9 9 0.166 0.0736 36.5 1.79 0.195 2 -104. 216.
# 10 10 0.110 0.0108 40.0 1.11 0.351 2 -106. 219.
# # … with 14 more rows, and 4 more variables: BIC <dbl>, deviance <dbl>,
# # df.residual <int>, nobs <int>

Credits to Bob Muenchen's Blog for inspiration on that.

How to get R^2 list when doing regression for each group separately in R

You want broom::glance for this:

df %>% 
group_by(art, shop) %>%
group_modify(~ {
fit <- my_lm(.x)
# tidy(fit)
# glance(fit)
full_join(tidy(fit), glance(fit), by = character())
}) %>%
ungroup()

Here I use a different approach, as what you're asking can be accomplished in a much more concise way. Not that the nest-strategy is bad.
Note that the by = character() in full_join indicates that it should be a Cartesian product.

For loop over a grouped tibble: to run gam LOESS with specific span value for each group

Use index to store the models. In your current code, you are overwriting the models in list_mymodels.

library(dplyr)
library(purrr)

span <- c(0.5, 0.3)
list_mymodels = vector('list', length(span))
for(i in seq_along(list_mymodels)) {
span1 <- span[[i]]
list_mymodels[[i]] <- turning_rate_4954 %>%
group_by(TEMP, SHARK, VIDEO) %>%
do(models= mygamLoess(x=.$Time_s,
y=.$turn_rate_dgs_s,
data=.,
span= span1,
degree=1))

}

Instead of for loop, you can use map or lapply :

list_mymodels <- map(span, ~turning_rate_4954 %>% 
group_by(TEMP, SHARK, VIDEO) %>%
do(models= mygamLoess(x=.$Time_s,
y=.$turn_rate_dgs_s,
data=.,
span= .x,
degree=1)))

If we want to apply only one model in each group we can use map2.

list_mymodels <- turning_rate_4954 %>%
group_split(TEMP, SHARK, VIDEO) %>%
map2(span, ~mygamLoess(x=.x$Time_s,
y=.x$turn_rate_dgs_s,
data=.x,
span= .y,
degree=1))

How to perform a group_by with elements that are contiguous in R and dplyr

If you want to use only base R + tidyverse, this code exactly replicates your desired results

mydf <- tibble(group = c("x", "x", "x", "y", "z", "x", "x", "z"), 
item = c(1, 2, 2, 3, 2, 2, 2, 1))

mydf

# A tibble: 8 × 2
group item
<chr> <dbl>
1 x 1
2 x 2
3 x 2
4 y 3
5 z 2
6 x 2
7 x 2
8 z 1

runs <- rle(mydf$group)

mydf %>%
mutate(run_id = rep(seq_along(runs$lengths), runs$lengths)) %>%
group_by(group, run_id) %>%
summarise(item = sum(item)) %>%
arrange(run_id) %>%
select(-run_id)

Source: local data frame [5 x 2]
Groups: group [3]

group item
<chr> <dbl>
1 x 5
2 y 3
3 z 2
4 x 4
5 z 1


Related Topics



Leave a reply



Submit