Trying to Use Dplyr to Group_By and Apply Scale()

Trying to use dplyr to group_by and apply scale()

The problem seems to be in the base scale() function, which expects a matrix. Try writing your own.

scale_this <- function(x){
(x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
}

Then this works:

library("dplyr")

# reproducible sample data
set.seed(123)
n = 1000
df <- data.frame(stud_ID = sample(LETTERS, size=n, replace=TRUE),
behavioral_scale = runif(n, 0, 10),
cognitive_scale = runif(n, 1, 20),
affective_scale = runif(n, 0, 1) )
scaled_data <-
df %>%
group_by(stud_ID) %>%
mutate(behavioral_scale_ind = scale_this(behavioral_scale),
cognitive_scale_ind = scale_this(cognitive_scale),
affective_scale_ind = scale_this(affective_scale))

Or, if you're open to a data.table solution:

library("data.table")

setDT(df)

cols_to_scale <- c("behavioral_scale","cognitive_scale","affective_scale")

df[, lapply(.SD, scale_this), .SDcols = cols_to_scale, keyby = factor(stud_ID)]

scale values within group in R

You could apply scale function by group :

This can be done in base R:

df$y2 <- with(df, ave(y, x, FUN = scale))
df

# x y y2
#1 1 1 -0.707107
#2 1 3 0.707107
#3 2 4 0.707107
#4 2 3 -0.707107
#5 3 5 1.091089
#6 3 2 -0.872872
#7 3 3 -0.218218

dplyr

library(dplyr)
df %>% group_by(x) %>% mutate(y2 = scale(y))

and in data.table :

library(data.table)
setDT(df)[, y2 := scale(y), x]

data

df <- data.frame(x=c(1,1,2,2,3,3,3),y=c(1,3,4,3,5,2,3))

Scaling by group in R using dplyr: grouping and non-grouping seem to generate the same result

Problem appears to be an error with the version 1.2.91 of RStudio. I downgraded to stable build (version 1.1.383), and the new output for mean(scaledByID$scaledScore == notScaledByID$scale) is 0.

Version of R is the same for both (3.4.2).

using dplyr to split-apply-combine to scale vectors within a grouping variable

You can pass data in map and scale wt column of each data.

library(tidyverse)

mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(wt.scaled = map(data, ~as.numeric(scale(.x$wt)))) %>%
unnest(c(wt.scaled, data))

# cyl mpg disp hp drat wt qsec vs am gear carb wt.scaled
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 6 21 160 110 3.9 2.62 16.5 0 1 4 4 -1.40
# 2 6 21 160 110 3.9 2.88 17.0 0 1 4 4 -0.680
# 3 6 21.4 258 110 3.08 3.22 19.4 1 0 3 1 0.275
# 4 6 18.1 225 105 2.76 3.46 20.2 1 0 3 1 0.962
# 5 6 19.2 168. 123 3.92 3.44 18.3 1 0 4 4 0.906
# 6 6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4 0.906
# 7 6 19.7 145 175 3.62 2.77 15.5 0 1 5 6 -0.974
# 8 4 22.8 108 93 3.85 2.32 18.6 1 1 4 1 0.0602
# 9 4 24.4 147. 62 3.69 3.19 20 1 0 4 2 1.59
#10 4 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 1.52
# … with 22 more rows

This is same as scaling wt by group :

mtcars %>%
group_by(cyl) %>%
mutate(wt.scaled = as.numeric(scale(wt)))

R - use group_by() and mutate() in dplyr to apply function that returns a vector the length of groups

How about making use of nest instead:

foo %>%
group_by(fac) %>%
nest() %>%
mutate(mahal = map(data, ~mahalanobis(
.x,
center = colMeans(.x, na.rm = T),
cov = cov(.x, use = "pairwise.complete.obs")))) %>%
unnest()
## A tibble: 10 x 4
# fac mahal x y
# <fct> <dbl> <dbl> <dbl>
# 1 A 1.02 -6.26 15.1
# 2 A 0.120 1.84 3.90
# 3 A 2.81 -8.36 -6.21
# 4 A 2.84 16.0 -22.1
# 5 A 1.21 3.30 11.2
# 6 B 2.15 -8.20 -0.449
# 7 B 2.86 4.87 -0.162
# 8 B 1.23 7.38 9.44
# 9 B 0.675 5.76 8.21
#10 B 1.08 -3.05 5.94

Here you avoid an explicit "x", "y" filter of the form temp <- x[, c("x", "y")], as you nest relevant columns after grouping by fac. Applying mahalanobis is then straight-forward.


Update

To respond to your comment, here is a purrr option. Since it's easy to loose track of what's going on, let's go step-by-step:

  1. Generate sample data with one additional column.

    set.seed(1)
    foo <- data.frame(
    x = rnorm(10, 0, 10),
    y = rnorm(10, 0, 10),
    z = rnorm(10, 0, 10),
    fac = c(rep("A", 5), rep("B", 5)))
  2. We now store the columns which define the subset of the data to be used for the calculation of the Mahalanobis distance in a list

    cols <- list(cols1 = c("x", "y"), cols2 = c("y", "z"))

    So we will calculate the Mahalanobis distance (per fac) for the subset of data in columns x+y and then separately for y+z. The names of cols will be used as the column names of the two distance vectors.

  3. Now for the actual purrr command:

    imap_dfc(cols, ~nest(foo %>% group_by(fac), .x, .key = !!.y) %>% select(!!.y)) %>%
    mutate_all(function(lst) map(lst, ~mahalanobis(
    .x,
    center = colMeans(.x, na.rm = T),
    cov = cov(., use = "pairwise.complete.obs")))) %>%
    unnest() %>%
    bind_cols(foo, .)
    # x y z fac cols1 cols2
    #1 -6.264538 15.1178117 9.1897737 A 1.0197542 1.3608052
    #2 1.836433 3.8984324 7.8213630 A 0.1199607 1.1141352
    #3 -8.356286 -6.2124058 0.7456498 A 2.8059562 1.5099574
    #4 15.952808 -22.1469989 -19.8935170 A 2.8401953 3.0675228
    #5 3.295078 11.2493092 6.1982575 A 1.2141337 0.9475794
    #6 -8.204684 -0.4493361 -0.5612874 B 2.1517055 1.2284793
    #7 4.874291 -0.1619026 -1.5579551 B 2.8626501 1.1724828
    #8 7.383247 9.4383621 -14.7075238 B 1.2271316 2.5723023
    #9 5.757814 8.2122120 -4.7815006 B 0.6746788 0.6939081
    #10 -3.053884 5.9390132 4.1794156 B 1.0838341 2.3328276

    In short, we

    1. loop over entries in cols,
    2. nest data in foo per fac based on columns defined in cols,
    3. apply mahalanobis on the nested and grouped data generating as many distance columns with nested data as we have entries in cols (i.e. subsets), and
    4. finally unnest the distance data and column-bind it to the original foo data.

Scale relative to a value in each group (via dplyr)

This solution is very similar to @thelatemail, but I think it's sufficiently different enough to merit its own answer because it chooses the index based on a condition:

data %>%
group_by(category) %>%
mutate(value = value/value[year == baseYear])

# category year value
#... ... ... ...
#7 A 2002 1.00000000
#8 B 2002 1.00000000
#9 C 2002 1.00000000
#10 A 2003 0.86462789
#11 B 2003 1.07217943
#12 C 2003 0.82209897

(Data output has been truncated. To replicate these results, set.seed(123) when creating data.)

dplyr: group_by, sum various columns, and apply a function based on grouped row sums?

To use dplyr, try the following :

library(dplyr)

df %>%
group_by(Percent_cover) %>%
summarise(across(contains("species"), sum)) %>%
mutate(rs = rowSums(select(., contains("species")))) %>%
mutate(across(contains('species'), ~./rs * 100)) -> result

result

For example, using mtcars :

mtcars %>%
group_by(cyl) %>%
summarise(across(disp:wt, sum)) %>%
mutate(rs = rowSums(select(., disp:wt))) %>%
mutate(across(disp:wt, ~./rs * 100))

# cyl disp hp drat wt rs
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 54.2 42.6 2.10 1.18 2135.
#2 6 58.7 39.2 1.15 0.998 2186.
#3 8 62.0 36.7 0.567 0.702 7974.

group_by and apply a rolling regression based on window using dplyr

We can do a group_split and map over the list elements and then apply the rollapply

library(zoo)
library(dplyr)
library(purrr)
out <- df %>%
group_split(stock) %>%
map(~ rollapply(.x,
width = 30,
FUN = function(dat) {
LinearModel = lm(formula = Close ~ dates, as.data.frame(dat))
LinearModel$coef
}, by.column = FALSE, fill = NA_real_, align = "right"))

length(out)
#[1] 2

If we want to update the original dataset with more columns

out <-  df %>% 
group_split(stock) %>%
map_dfr(~ {
subdat <- .x
rollapply(subdat,
width = 30,
FUN = function(dat) {
LinearModel = lm(formula = Close ~ dates, as.data.frame(dat))
LinearModel$coef
}, by.column = FALSE, fill = NA_real_, align = "right") %>%
as.data.frame %>%
bind_cols(subdat, .)

}

)

ncol(out)
#[1] 38

ncol(df)
#[1] 8

In the devel version of dplyr, we can also do

out1 <- df %>% 
group_by(stock) %>%
condense(out =rollapply(cur_data(), width = 30,
FUN = function(dat) lm(Close ~ dates, as.data.frame(dat))$coef,
by.column = FALSE, fill = NA_real_, align = "right") %>%
as.data.frame %>%
bind_cols(cur_data(), .))
out1
# A tibble: 2 x 2
# Rowwise: stock
# stock out
# <chr> <list>
#1 1 <tibble [3,309 × 37]>
#2 2 <tibble [3,309 × 37]>

The list column can be unnested when it is required

out1 %>% 
unnest(c(out)) %>%
head(3)
# A tibble: 3 x 38
# stock Open High Low Close Volumn Adjusted dates `(Intercept)` `dates2007-01-0…
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <dbl>
#1 1 232. 237. 230. 233. 1.55e7 233. 2007-01-03 NA NA
#2 1 234. 241. 233. 241. 1.58e7 241. 2007-01-04 NA NA
#3 1 240. 243. 238. 243. 1.38e7 243. 2007-01-05 NA NA
# … with 28 more variables: `dates2007-01-05` <dbl>, `dates2007-01-08` <dbl>,
# `dates2007-01-09` <dbl>, `dates2007-01-10` <dbl>, `dates2007-01-11` <dbl>,
# `dates2007-01-12` <dbl>, `dates2007-01-16` <dbl>, `dates2007-01-17` <dbl>,
# `dates2007-01-18` <dbl>, `dates2007-01-19` <dbl>, `dates2007-01-22` <dbl>,
# `dates2007-01-23` <dbl>, `dates2007-01-24` <dbl>, `dates2007-01-25` <dbl>,
# `dates2007-01-26` <dbl>, `dates2007-01-29` <dbl>, `dates2007-01-30` <dbl>,
# `dates2007-01-31` <dbl>, `dates2007-02-01` <dbl>, `dates2007-02-02` <dbl>,
# `dates2007-02-05` <dbl>, `dates2007-02-06` <dbl>, `dates2007-02-07` <dbl>,
# `dates2007-02-08` <dbl>, `dates2007-02-09` <dbl>, `dates2007-02-12` <dbl>,
# `dates2007-02-13` <dbl>, `dates2007-02-14` <dbl>

We can apply the tidy within the condense

library(broom)

out3 <- df %>%
group_split(stock) %>%
map_dfr(~ {
subdat <- .x
rollapply(subdat,
width = 30,
FUN = function(dat) {
LinearModel = lm(formula = Close ~ dates, as.data.frame(dat))
tidy(LinearModel)
}, by.column = FALSE, fill = NA_real_, align = "right") %>%
as.data.frame %>%
bind_cols(subdat, .)

}

)

dim(out3)
#[1] 6618 13
names(out3)
# [1] "Open" "High" "Low" "Close" "Volumn" "Adjusted" "stock"
# [8] "dates" "term" "estimate" "std.error" "statistic" "p.value"

scale/normalize columns by group

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
group_by(Store) %>%
mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
group_by(Store) %>%
mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.



Related Topics



Leave a reply



Submit