Trying to use dplyr to group_by and apply scale()
The problem seems to be in the base scale()
function, which expects a matrix. Try writing your own.
scale_this <- function(x){
(x - mean(x, na.rm=TRUE)) / sd(x, na.rm=TRUE)
}
Then this works:
library("dplyr")
# reproducible sample data
set.seed(123)
n = 1000
df <- data.frame(stud_ID = sample(LETTERS, size=n, replace=TRUE),
behavioral_scale = runif(n, 0, 10),
cognitive_scale = runif(n, 1, 20),
affective_scale = runif(n, 0, 1) )
scaled_data <-
df %>%
group_by(stud_ID) %>%
mutate(behavioral_scale_ind = scale_this(behavioral_scale),
cognitive_scale_ind = scale_this(cognitive_scale),
affective_scale_ind = scale_this(affective_scale))
Or, if you're open to a data.table
solution:
library("data.table")
setDT(df)
cols_to_scale <- c("behavioral_scale","cognitive_scale","affective_scale")
df[, lapply(.SD, scale_this), .SDcols = cols_to_scale, keyby = factor(stud_ID)]
scale values within group in R
You could apply scale
function by group :
This can be done in base R:
df$y2 <- with(df, ave(y, x, FUN = scale))
df
# x y y2
#1 1 1 -0.707107
#2 1 3 0.707107
#3 2 4 0.707107
#4 2 3 -0.707107
#5 3 5 1.091089
#6 3 2 -0.872872
#7 3 3 -0.218218
dplyr
library(dplyr)
df %>% group_by(x) %>% mutate(y2 = scale(y))
and in data.table
:
library(data.table)
setDT(df)[, y2 := scale(y), x]
data
df <- data.frame(x=c(1,1,2,2,3,3,3),y=c(1,3,4,3,5,2,3))
Scaling by group in R using dplyr: grouping and non-grouping seem to generate the same result
Problem appears to be an error with the version 1.2.91 of RStudio. I downgraded to stable build (version 1.1.383), and the new output for mean(scaledByID$scaledScore == notScaledByID$scale)
is 0
.
Version of R is the same for both (3.4.2).
using dplyr to split-apply-combine to scale vectors within a grouping variable
You can pass data
in map
and scale
wt
column of each data.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest() %>%
mutate(wt.scaled = map(data, ~as.numeric(scale(.x$wt)))) %>%
unnest(c(wt.scaled, data))
# cyl mpg disp hp drat wt qsec vs am gear carb wt.scaled
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 6 21 160 110 3.9 2.62 16.5 0 1 4 4 -1.40
# 2 6 21 160 110 3.9 2.88 17.0 0 1 4 4 -0.680
# 3 6 21.4 258 110 3.08 3.22 19.4 1 0 3 1 0.275
# 4 6 18.1 225 105 2.76 3.46 20.2 1 0 3 1 0.962
# 5 6 19.2 168. 123 3.92 3.44 18.3 1 0 4 4 0.906
# 6 6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4 0.906
# 7 6 19.7 145 175 3.62 2.77 15.5 0 1 5 6 -0.974
# 8 4 22.8 108 93 3.85 2.32 18.6 1 1 4 1 0.0602
# 9 4 24.4 147. 62 3.69 3.19 20 1 0 4 2 1.59
#10 4 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 1.52
# … with 22 more rows
This is same as scaling wt
by group :
mtcars %>%
group_by(cyl) %>%
mutate(wt.scaled = as.numeric(scale(wt)))
R - use group_by() and mutate() in dplyr to apply function that returns a vector the length of groups
How about making use of nest
instead:
foo %>%
group_by(fac) %>%
nest() %>%
mutate(mahal = map(data, ~mahalanobis(
.x,
center = colMeans(.x, na.rm = T),
cov = cov(.x, use = "pairwise.complete.obs")))) %>%
unnest()
## A tibble: 10 x 4
# fac mahal x y
# <fct> <dbl> <dbl> <dbl>
# 1 A 1.02 -6.26 15.1
# 2 A 0.120 1.84 3.90
# 3 A 2.81 -8.36 -6.21
# 4 A 2.84 16.0 -22.1
# 5 A 1.21 3.30 11.2
# 6 B 2.15 -8.20 -0.449
# 7 B 2.86 4.87 -0.162
# 8 B 1.23 7.38 9.44
# 9 B 0.675 5.76 8.21
#10 B 1.08 -3.05 5.94
Here you avoid an explicit "x"
, "y"
filter of the form temp <- x[, c("x", "y")]
, as you nest
relevant columns after grouping by fac
. Applying mahalanobis
is then straight-forward.
Update
To respond to your comment, here is a purrr
option. Since it's easy to loose track of what's going on, let's go step-by-step:
Generate sample data with one additional column.
set.seed(1)
foo <- data.frame(
x = rnorm(10, 0, 10),
y = rnorm(10, 0, 10),
z = rnorm(10, 0, 10),
fac = c(rep("A", 5), rep("B", 5)))We now store the columns which define the subset of the data to be used for the calculation of the Mahalanobis distance in a
list
cols <- list(cols1 = c("x", "y"), cols2 = c("y", "z"))
So we will calculate the Mahalanobis distance (per
fac
) for the subset of data in columnsx
+y
and then separately fory
+z
. The names ofcols
will be used as the column names of the two distance vectors.Now for the actual
purrr
command:imap_dfc(cols, ~nest(foo %>% group_by(fac), .x, .key = !!.y) %>% select(!!.y)) %>%
mutate_all(function(lst) map(lst, ~mahalanobis(
.x,
center = colMeans(.x, na.rm = T),
cov = cov(., use = "pairwise.complete.obs")))) %>%
unnest() %>%
bind_cols(foo, .)
# x y z fac cols1 cols2
#1 -6.264538 15.1178117 9.1897737 A 1.0197542 1.3608052
#2 1.836433 3.8984324 7.8213630 A 0.1199607 1.1141352
#3 -8.356286 -6.2124058 0.7456498 A 2.8059562 1.5099574
#4 15.952808 -22.1469989 -19.8935170 A 2.8401953 3.0675228
#5 3.295078 11.2493092 6.1982575 A 1.2141337 0.9475794
#6 -8.204684 -0.4493361 -0.5612874 B 2.1517055 1.2284793
#7 4.874291 -0.1619026 -1.5579551 B 2.8626501 1.1724828
#8 7.383247 9.4383621 -14.7075238 B 1.2271316 2.5723023
#9 5.757814 8.2122120 -4.7815006 B 0.6746788 0.6939081
#10 -3.053884 5.9390132 4.1794156 B 1.0838341 2.3328276In short, we
- loop over entries in
cols
, nest
data infoo
perfac
based on columns defined incols
,- apply
mahalanobis
on the nested and grouped data generating as many distance columns with nested data as we have entries incols
(i.e. subsets), and - finally
unnest
the distance data and column-bind it to the originalfoo
data.
- loop over entries in
Scale relative to a value in each group (via dplyr)
This solution is very similar to @thelatemail, but I think it's sufficiently different enough to merit its own answer because it chooses the index based on a condition:
data %>%
group_by(category) %>%
mutate(value = value/value[year == baseYear])
# category year value
#... ... ... ...
#7 A 2002 1.00000000
#8 B 2002 1.00000000
#9 C 2002 1.00000000
#10 A 2003 0.86462789
#11 B 2003 1.07217943
#12 C 2003 0.82209897
(Data output has been truncated. To replicate these results, set.seed(123)
when creating data
.)
dplyr: group_by, sum various columns, and apply a function based on grouped row sums?
To use dplyr
, try the following :
library(dplyr)
df %>%
group_by(Percent_cover) %>%
summarise(across(contains("species"), sum)) %>%
mutate(rs = rowSums(select(., contains("species")))) %>%
mutate(across(contains('species'), ~./rs * 100)) -> result
result
For example, using mtcars
:
mtcars %>%
group_by(cyl) %>%
summarise(across(disp:wt, sum)) %>%
mutate(rs = rowSums(select(., disp:wt))) %>%
mutate(across(disp:wt, ~./rs * 100))
# cyl disp hp drat wt rs
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 54.2 42.6 2.10 1.18 2135.
#2 6 58.7 39.2 1.15 0.998 2186.
#3 8 62.0 36.7 0.567 0.702 7974.
group_by and apply a rolling regression based on window using dplyr
We can do a group_split
and map
over the list
elements and then apply the rollapply
library(zoo)
library(dplyr)
library(purrr)
out <- df %>%
group_split(stock) %>%
map(~ rollapply(.x,
width = 30,
FUN = function(dat) {
LinearModel = lm(formula = Close ~ dates, as.data.frame(dat))
LinearModel$coef
}, by.column = FALSE, fill = NA_real_, align = "right"))
length(out)
#[1] 2
If we want to update the original dataset with more columns
out <- df %>%
group_split(stock) %>%
map_dfr(~ {
subdat <- .x
rollapply(subdat,
width = 30,
FUN = function(dat) {
LinearModel = lm(formula = Close ~ dates, as.data.frame(dat))
LinearModel$coef
}, by.column = FALSE, fill = NA_real_, align = "right") %>%
as.data.frame %>%
bind_cols(subdat, .)
}
)
ncol(out)
#[1] 38
ncol(df)
#[1] 8
In the devel version of dplyr
, we can also do
out1 <- df %>%
group_by(stock) %>%
condense(out =rollapply(cur_data(), width = 30,
FUN = function(dat) lm(Close ~ dates, as.data.frame(dat))$coef,
by.column = FALSE, fill = NA_real_, align = "right") %>%
as.data.frame %>%
bind_cols(cur_data(), .))
out1
# A tibble: 2 x 2
# Rowwise: stock
# stock out
# <chr> <list>
#1 1 <tibble [3,309 × 37]>
#2 2 <tibble [3,309 × 37]>
The list
column can be unnest
ed when it is required
out1 %>%
unnest(c(out)) %>%
head(3)
# A tibble: 3 x 38
# stock Open High Low Close Volumn Adjusted dates `(Intercept)` `dates2007-01-0…
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <dbl>
#1 1 232. 237. 230. 233. 1.55e7 233. 2007-01-03 NA NA
#2 1 234. 241. 233. 241. 1.58e7 241. 2007-01-04 NA NA
#3 1 240. 243. 238. 243. 1.38e7 243. 2007-01-05 NA NA
# … with 28 more variables: `dates2007-01-05` <dbl>, `dates2007-01-08` <dbl>,
# `dates2007-01-09` <dbl>, `dates2007-01-10` <dbl>, `dates2007-01-11` <dbl>,
# `dates2007-01-12` <dbl>, `dates2007-01-16` <dbl>, `dates2007-01-17` <dbl>,
# `dates2007-01-18` <dbl>, `dates2007-01-19` <dbl>, `dates2007-01-22` <dbl>,
# `dates2007-01-23` <dbl>, `dates2007-01-24` <dbl>, `dates2007-01-25` <dbl>,
# `dates2007-01-26` <dbl>, `dates2007-01-29` <dbl>, `dates2007-01-30` <dbl>,
# `dates2007-01-31` <dbl>, `dates2007-02-01` <dbl>, `dates2007-02-02` <dbl>,
# `dates2007-02-05` <dbl>, `dates2007-02-06` <dbl>, `dates2007-02-07` <dbl>,
# `dates2007-02-08` <dbl>, `dates2007-02-09` <dbl>, `dates2007-02-12` <dbl>,
# `dates2007-02-13` <dbl>, `dates2007-02-14` <dbl>
We can apply the tidy
within the condense
library(broom)
out3 <- df %>%
group_split(stock) %>%
map_dfr(~ {
subdat <- .x
rollapply(subdat,
width = 30,
FUN = function(dat) {
LinearModel = lm(formula = Close ~ dates, as.data.frame(dat))
tidy(LinearModel)
}, by.column = FALSE, fill = NA_real_, align = "right") %>%
as.data.frame %>%
bind_cols(subdat, .)
}
)
dim(out3)
#[1] 6618 13
names(out3)
# [1] "Open" "High" "Low" "Close" "Volumn" "Adjusted" "stock"
# [8] "dates" "term" "estimate" "std.error" "statistic" "p.value"
scale/normalize columns by group
The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:
df %>%
group_by(Store) %>%
mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))
df %>%
group_by(Store) %>%
mutate_each(funs(normalit), Temperature, Sum_Sales)
Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.
Related Topics
Selection of Activity Trace in a Chart and Display in a Data Table in R Shiny
Change Size of Axes Title and Labels in Ggplot2
How to Play Birthday Music Using R
Shift Legend into Empty Facets of a Faceted Plot in Ggplot2
Convert a Row of a Data Frame to Vector
Get the Path of Current Script
How to Increase the Size of Points in Legend of Ggplot2
Given a Set of Random Numbers Drawn from a Continuous Univariate Distribution, Find the Distribution
Importing Data into R from Google Spreadsheet
What Are 'User' and 'System' Times Measuring in R System.Time(Exp) Output
How to Convert a Huge List-Of-Vector to a Matrix More Efficiently
Shiny Dynamic Filter Variable Selection and Display of Variable Values for Selection
How to Extract All the Rows If a Level in One Column Contains All the Levels of Another Column in R
Multiple Boxplots Using Ggplot
R Install Package Loaded Namespace
How to Expand an Ellipsis (...) Argument Without Evaluating It in R