Trying to Use Dplyr to Group_By and Apply Scale()

Trying to use dplyr to group_by and apply scale()

The problem seems to be in the base scale() function, which expects a matrix. Try writing your own.

scale_this <- function(x){
  (x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
}

Then this works:

library("dplyr")

# reproducible sample data
set.seed(123)
n = 1000
df <- data.frame(stud_ID = sample(LETTERS, size=n, replace=TRUE),
                 behavioral_scale = runif(n, 0, 10),
                 cognitive_scale = runif(n, 1, 20),
                 affective_scale = runif(n, 0, 1) )
scaled_data <- 
  df %>%
  group_by(stud_ID) %>%
  mutate(behavioral_scale_ind = scale_this(behavioral_scale),
         cognitive_scale_ind = scale_this(cognitive_scale),
         affective_scale_ind = scale_this(affective_scale))

Or, if you're open to a data.table solution:

library("data.table")

setDT(df)

cols_to_scale <- c("behavioral_scale","cognitive_scale","affective_scale")

df[, lapply(.SD, scale_this), .SDcols = cols_to_scale, keyby = factor(stud_ID)]

scale values within group in R

You could apply scale function by group :

This can be done in base R:

df$y2 <- with(df, ave(y, x, FUN = scale))
df

#  x y        y2
#1 1 1 -0.707107
#2 1 3  0.707107
#3 2 4  0.707107
#4 2 3 -0.707107
#5 3 5  1.091089
#6 3 2 -0.872872
#7 3 3 -0.218218

dplyr

library(dplyr)
df %>% group_by(x) %>% mutate(y2 = scale(y))

and in data.table :

library(data.table)
setDT(df)[, y2 := scale(y), x]

data

df <- data.frame(x=c(1,1,2,2,3,3,3),y=c(1,3,4,3,5,2,3))

Scaling by group in R using dplyr: grouping and non-grouping seem to generate the same result

Problem appears to be an error with the version 1.2.91 of RStudio. I downgraded to stable build (version 1.1.383), and the new output for mean(scaledByID$scaledScore == notScaledByID$scale) is 0.

Version of R is the same for both (3.4.2).

using dplyr to split-apply-combine to scale vectors within a grouping variable

You can pass data in map and scale wt column of each data.

library(tidyverse)

mtcars %>%
  group_by(cyl) %>%
  nest() %>%
  mutate(wt.scaled = map(data, ~as.numeric(scale(.x$wt)))) %>%
  unnest(c(wt.scaled, data))

#     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb wt.scaled
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
# 1     6  21    160    110  3.9   2.62  16.5     0     1     4     4   -1.40  
# 2     6  21    160    110  3.9   2.88  17.0     0     1     4     4   -0.680 
# 3     6  21.4  258    110  3.08  3.22  19.4     1     0     3     1    0.275 
# 4     6  18.1  225    105  2.76  3.46  20.2     1     0     3     1    0.962 
# 5     6  19.2  168.   123  3.92  3.44  18.3     1     0     4     4    0.906 
# 6     6  17.8  168.   123  3.92  3.44  18.9     1     0     4     4    0.906 
# 7     6  19.7  145    175  3.62  2.77  15.5     0     1     5     6   -0.974 
# 8     4  22.8  108     93  3.85  2.32  18.6     1     1     4     1    0.0602
# 9     4  24.4  147.    62  3.69  3.19  20       1     0     4     2    1.59  
#10     4  22.8  141.    95  3.92  3.15  22.9     1     0     4     2    1.52  
# … with 22 more rows

This is same as scaling wt by group :

mtcars %>%
  group_by(cyl) %>%
  mutate(wt.scaled = as.numeric(scale(wt)))

R - use group_by() and mutate() in dplyr to apply function that returns a vector the length of groups

How about making use of nest instead:

foo %>%
    group_by(fac) %>%
    nest() %>%
    mutate(mahal = map(data, ~mahalanobis(
        .x,
        center = colMeans(.x, na.rm = T),
        cov = cov(.x, use = "pairwise.complete.obs")))) %>%
    unnest()
## A tibble: 10 x 4
#   fac   mahal      x       y
#   <fct> <dbl>  <dbl>   <dbl>
# 1 A     1.02   -6.26  15.1
# 2 A     0.120   1.84   3.90
# 3 A     2.81   -8.36  -6.21
# 4 A     2.84   16.0  -22.1
# 5 A     1.21    3.30  11.2
# 6 B     2.15   -8.20  -0.449
# 7 B     2.86    4.87  -0.162
# 8 B     1.23    7.38   9.44
# 9 B     0.675   5.76   8.21
#10 B     1.08   -3.05   5.94

Here you avoid an explicit "x", "y" filter of the form temp <- x[, c("x", "y")], as you nest relevant columns after grouping by fac. Applying mahalanobis is then straight-forward.

Update

To respond to your comment, here is a purrr option. Since it's easy to loose track of what's going on, let's go step-by-step:

Generate sample data with one additional column.

set.seed(1)
foo <- data.frame(
    x = rnorm(10, 0, 10),
    y = rnorm(10, 0, 10),
    z = rnorm(10, 0, 10),
    fac = c(rep("A", 5), rep("B", 5)))

We now store the columns which define the subset of the data to be used for the calculation of the Mahalanobis distance in a list
```
cols <- list(cols1 = c("x", "y"), cols2 = c("y", "z"))
```
So we will calculate the Mahalanobis distance (per fac) for the subset of data in columns x+y and then separately for y+z. The names of cols will be used as the column names of the two distance vectors.

Now for the actual purrr command:

imap_dfc(cols, ~nest(foo %>% group_by(fac), .x, .key = !!.y) %>% select(!!.y)) %>%
    mutate_all(function(lst) map(lst, ~mahalanobis(
        .x,
        center = colMeans(.x, na.rm = T),
        cov = cov(., use = "pairwise.complete.obs")))) %>%
    unnest() %>%
    bind_cols(foo, .)
#           x           y           z fac     cols1     cols2
#1  -6.264538  15.1178117   9.1897737   A 1.0197542 1.3608052
#2   1.836433   3.8984324   7.8213630   A 0.1199607 1.1141352
#3  -8.356286  -6.2124058   0.7456498   A 2.8059562 1.5099574
#4  15.952808 -22.1469989 -19.8935170   A 2.8401953 3.0675228
#5   3.295078  11.2493092   6.1982575   A 1.2141337 0.9475794
#6  -8.204684  -0.4493361  -0.5612874   B 2.1517055 1.2284793
#7   4.874291  -0.1619026  -1.5579551   B 2.8626501 1.1724828
#8   7.383247   9.4383621 -14.7075238   B 1.2271316 2.5723023
#9   5.757814   8.2122120  -4.7815006   B 0.6746788 0.6939081
#10 -3.053884   5.9390132   4.1794156   B 1.0838341 2.3328276

In short, we

loop over entries in cols,
nest data in foo per fac based on columns defined in cols,
apply mahalanobis on the nested and grouped data generating as many distance columns with nested data as we have entries in cols (i.e. subsets), and
finally unnest the distance data and column-bind it to the original foo data.

Scale relative to a value in each group (via dplyr)

This solution is very similar to @thelatemail, but I think it's sufficiently different enough to merit its own answer because it chooses the index based on a condition:

data %>%
    group_by(category) %>%
    mutate(value = value/value[year == baseYear])

#   category  year      value
#...     ...   ...       ...
#7         A  2002 1.00000000
#8         B  2002 1.00000000
#9         C  2002 1.00000000
#10        A  2003 0.86462789
#11        B  2003 1.07217943
#12        C  2003 0.82209897

(Data output has been truncated. To replicate these results, set.seed(123) when creating data.)

dplyr: group_by, sum various columns, and apply a function based on grouped row sums?

To use dplyr, try the following :

library(dplyr)

df %>% 
  group_by(Percent_cover) %>% 
  summarise(across(contains("species"), sum)) %>%
  mutate(rs = rowSums(select(., contains("species")))) %>%
  mutate(across(contains('species'), ~./rs * 100)) -> result

result

For example, using mtcars :

mtcars %>%
  group_by(cyl) %>%
  summarise(across(disp:wt, sum)) %>%
  mutate(rs = rowSums(select(., disp:wt))) %>%
  mutate(across(disp:wt, ~./rs * 100))

#   cyl  disp    hp  drat    wt    rs
#  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1     4  54.2  42.6 2.10  1.18  2135.
#2     6  58.7  39.2 1.15  0.998 2186.
#3     8  62.0  36.7 0.567 0.702 7974.

group_by and apply a rolling regression based on window using dplyr

We can do a group_split and map over the list elements and then apply the rollapply

library(zoo)
library(dplyr)
library(purrr)
out <- df %>% 
        group_split(stock) %>%
        map(~ rollapply(.x,
           width = 30,
           FUN =  function(dat) {
           LinearModel = lm(formula = Close ~ dates,  as.data.frame(dat))
           LinearModel$coef
           }, by.column = FALSE, fill = NA_real_,  align = "right"))

length(out)
#[1] 2

If we want to update the original dataset with more columns

out <-  df %>% 
       group_split(stock) %>%
       map_dfr(~ {
           subdat <- .x
           rollapply(subdat,
           width = 30,
           FUN =  function(dat) {
           LinearModel = lm(formula = Close ~ dates,  as.data.frame(dat))
           LinearModel$coef
           }, by.column = FALSE, fill = NA_real_,  align = "right") %>% 
               as.data.frame %>%
               bind_cols(subdat, .)

           }

           )

ncol(out)
#[1] 38

ncol(df)
#[1] 8

In the devel version of dplyr, we can also do

out1 <- df %>% 
           group_by(stock) %>%
          condense(out =rollapply(cur_data(), width = 30,
           FUN = function(dat) lm(Close ~ dates, as.data.frame(dat))$coef,
           by.column = FALSE, fill = NA_real_, align = "right") %>% 
           as.data.frame %>%
           bind_cols(cur_data(), .))
out1
# A tibble: 2 x 2
# Rowwise:  stock
#  stock out                  
#  <chr> <list>               
#1 1     <tibble [3,309 × 37]>
#2 2     <tibble [3,309 × 37]>

The list column can be unnested when it is required

out1 %>% 
    unnest(c(out)) %>%
    head(3)
# A tibble: 3 x 38
#  stock  Open  High   Low Close Volumn Adjusted dates      `(Intercept)` `dates2007-01-0…
#  <chr> <dbl> <dbl> <dbl> <dbl>  <dbl>    <dbl> <date>             <dbl>            <dbl>
#1 1      232.  237.  230.  233. 1.55e7     233. 2007-01-03            NA               NA
#2 1      234.  241.  233.  241. 1.58e7     241. 2007-01-04            NA               NA
#3 1      240.  243.  238.  243. 1.38e7     243. 2007-01-05            NA               NA
# … with 28 more variables: `dates2007-01-05` <dbl>, `dates2007-01-08` <dbl>,
#   `dates2007-01-09` <dbl>, `dates2007-01-10` <dbl>, `dates2007-01-11` <dbl>,
#   `dates2007-01-12` <dbl>, `dates2007-01-16` <dbl>, `dates2007-01-17` <dbl>,
#   `dates2007-01-18` <dbl>, `dates2007-01-19` <dbl>, `dates2007-01-22` <dbl>,
#   `dates2007-01-23` <dbl>, `dates2007-01-24` <dbl>, `dates2007-01-25` <dbl>,
#   `dates2007-01-26` <dbl>, `dates2007-01-29` <dbl>, `dates2007-01-30` <dbl>,
#   `dates2007-01-31` <dbl>, `dates2007-02-01` <dbl>, `dates2007-02-02` <dbl>,
#   `dates2007-02-05` <dbl>, `dates2007-02-06` <dbl>, `dates2007-02-07` <dbl>,
#   `dates2007-02-08` <dbl>, `dates2007-02-09` <dbl>, `dates2007-02-12` <dbl>,
#   `dates2007-02-13` <dbl>, `dates2007-02-14` <dbl>

We can apply the tidy within the condense

library(broom)

out3 <-  df %>% 
   group_split(stock) %>%
   map_dfr(~ {
       subdat <- .x
       rollapply(subdat,
       width = 30,
       FUN =  function(dat) {
       LinearModel = lm(formula = Close ~ dates,  as.data.frame(dat))
       tidy(LinearModel)
       }, by.column = FALSE, fill = NA_real_,  align = "right") %>% 
           as.data.frame %>%
           bind_cols(subdat, .)

       }

       )

dim(out3)
#[1] 6618   13
names(out3)
# [1] "Open"      "High"      "Low"       "Close"     "Volumn"    "Adjusted"  "stock"    
# [8] "dates"     "term"      "estimate"  "std.error" "statistic" "p.value"

scale/normalize columns by group

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
    group_by(Store) %>%
    mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
    group_by(Store) %>%
    mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.

Trying to Use Dplyr to Group_By and Apply Scale()