Standardize data columns in R
I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale
function on the data to do what you want.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)
# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)
Using built in functions is classy. Like this cat:
Standardize variables using dplyr [r]
It is not clear why you first make data in long format and then return back to wide, neither why you don't prefer scale(df)
which is computationally much faster.
Anyway, if you really want to use a code similar to the one you like, you need to perform a further unnest
action in order to return the data to the original format.
df %>%
gather() %>%
group_by(key) %>%
mutate(value = value - mean(value)) %>%
ungroup() %>%
pivot_wider(names_from = key, values_from = value) %>%
unnest(everything())
# A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.909 -0.188 -70.7 -36.7 0.303 -0.597 -1.39 -0.438 0.594 0.312 1.19
# 2 0.909 -0.188 -70.7 -36.7 0.303 -0.342 -0.829 -0.438 0.594 0.312 1.19
# 3 2.71 -2.19 -123. -53.7 0.253 -0.897 0.761 0.562 0.594 0.312 -1.81
# 4 1.31 -0.188 27.3 -36.7 -0.517 -0.00225 1.59 0.562 -0.406 -0.688 -1.81
# 5 -1.39 1.81 129. 28.3 -0.447 0.223 -0.829 -0.438 -0.406 -0.688 -0.812
# 6 -1.99 -0.188 -5.72 -41.7 -0.837 0.243 2.37 0.562 -0.406 -0.688 -1.81
# 7 -5.79 1.81 129. 98.3 -0.387 0.353 -2.01 -0.438 -0.406 -0.688 1.19
# 8 4.31 -2.19 -84.0 -84.7 0.0934 -0.0272 2.15 0.562 -0.406 0.312 -0.812
# 9 2.71 -2.19 -89.9 -51.7 0.323 -0.0673 5.05 0.562 -0.406 0.312 -0.812
# 10 -0.891 -0.188 -63.1 -23.7 0.323 0.223 0.451 0.562 -0.406 0.312 1.19
# ... with 22 more rows
EDIT
In order to use the most up-to-date tidyr
functions, you should consider substituting gather
, which is retired now, with pivot_longer
as in the following code. The obtained result is the same.
df %>%
pivot_longer(everything()) %>%
group_by(name) %>%
mutate(value = value - mean(value)) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value) %>%
unnest(everything())
How to standardize values in a column based on grouping by two other columns in R?
What you have done seems to be right. You cannot use summarise()
as it returns a single value and not a vector.
From your question, I'm not sure if you want to scale the value for each group or find the sum of value for each group. I've sampled both the cases.
# Sample data
age sex values
1 <10 M 1
2 <10 M 2
3 >10 F 3
4 >10 F 4
5 >10 M 5
# Scaling value
df %>% group_by(age, sex) %>% mutate(std_value = scale(values))
age sex values std_value
<fct> <fct> <dbl> <dbl>
1 <10 M 1 -0.707
2 <10 M 2 0.707
3 >10 F 3 -0.707
4 >10 F 4 0.707
5 >10 M 5 NaN
# Sum of values
df %>% group_by(age, sex) %>% mutate(sum_value = sum(values))
age sex values sum_value
<fct> <fct> <dbl> <dbl>
1 <10 M 1 3
2 <10 M 2 3
3 >10 F 3 7
4 >10 F 4 7
5 >10 M 5 5
Standardize data columns in R in subgrups
You can use dplyr
for this:
library(dplyr)
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
To work around rows that return NA
s, you can either keep the original values as they are or filter them out before scaling:
Keeping the original values (by keeping scaling only instances where NROW
is greater than 1):
new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = ifelse(NROW(salesValue) > 1, scale(salesValue), salesValue)) %>%
ungroup
Filtering them out (as suggested by @steveb):
new_df <- df %>% group_by(salesPerson, quater) %>%
filter(n() > 1) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup
I hope this helps.
R Normalize Many Columns
Try scale
like below
cbind(data, `colnames<-`(scale(data[normalize_these]), paste0(normalize_these, "NEW")))
If you would like to use data.table
, below might be an option
setDT(data)
data[, paste0(normalize_these, "NEW") := lapply(.SD, scale), .SDcols = normalize_these]
Standardizing columns by groups using r
Could not use the example that you gave: So will use the iris
dataset:
library(tidyverse)
iris %>%
group_by(Species) %>%
mutate(across(contains(c("Width","Length")),scale))
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length[,1] Sepal.Width[,1] Petal.Length[,1] Petal.Width[,1] Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 0.267 0.190 -0.357 -0.436 setosa
2 -0.301 -1.13 -0.357 -0.436 setosa
3 -0.868 -0.601 -0.933 -0.436 setosa
4 -1.15 -0.865 0.219 -0.436 setosa
5 -0.0170 0.454 -0.357 -0.436 setosa
6 1.12 1.25 1.37 1.46 setosa
7 -1.15 -0.0739 -0.357 0.512 setosa
8 -0.0170 -0.0739 0.219 -0.436 setosa
9 -1.72 -1.39 -0.357 -0.436 setosa
10 -0.301 -0.865 0.219 -1.39 setosa
# ... with 140 more rows
If you are still using the old format, you could do:
iris %>%
group_by(Species) %>%
mutate_at(vars(contains(c("Width","Length"))),scale)
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length[,1] Sepal.Width[,1] Petal.Length[,1] Petal.Width[,1] Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 0.267 0.190 -0.357 -0.436 setosa
2 -0.301 -1.13 -0.357 -0.436 setosa
3 -0.868 -0.601 -0.933 -0.436 setosa
4 -1.15 -0.865 0.219 -0.436 setosa
5 -0.0170 0.454 -0.357 -0.436 setosa
6 1.12 1.25 1.37 1.46 setosa
7 -1.15 -0.0739 -0.357 0.512 setosa
8 -0.0170 -0.0739 0.219 -0.436 setosa
9 -1.72 -1.39 -0.357 -0.436 setosa
10 -0.301 -0.865 0.219 -1.39 setosa
# ... with 140 more rows
How to standardize data in r before running regressions in r?
Here are a couple of options to create a logical index indicating whether or not an outlier is detected. I've considered an outlier to be greater than 3 sds above the mean, as you have indicated (not checking for "lesser than").
Using apply
, specifying the argument margin=2
to apply the function column by column.
numericaldata <- datafinal[2:ncol(datafinal)]
apply(numericaldata, margin=2, function(x) {
x>3 * sd(x, na.rm = TRUE) + mean(x, na.rm = TRUE)
})%>%
cbind.data.frame(Date1=datafinal$Date1,.)
Using dplyr:
datafinal%>%
mutate(across(.cols = 2:ncol(.), ~ . > 3 * sd(., na.rm = TRUE) + mean(., na.rm = TRUE)))
Results:
Date1 A B C D E
1 2017-01-01 FALSE FALSE FALSE NA FALSE
2 2017-02-01 FALSE FALSE FALSE NA FALSE
3 2017-03-01 FALSE FALSE FALSE FALSE FALSE
4 2017-04-01 FALSE FALSE FALSE FALSE FALSE
Should you want to know how many outliers were found, you can summarise the results like so:
results%>%
summarise(across(2:ncol(.), ~sum(., na.rm=TRUE)))
R Standardizing numeric variables in dataframe while retaining factor variables
You need to use rapply
instead of sapply
set.seed(1)
> df=data.frame(A=rnorm(10),b=1:10,C=as.factor(rep(1:2,5)))
> str(df)
'data.frame': 10 obs. of 3 variables:
$ A: num -0.626 0.184 -0.836 1.595 0.33 ...
$ b: int 1 2 3 4 5 6 7 8 9 10
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
The code you need to use:
> D=rapply(df,scale,c("numeric","integer"),how="replace")
> D
A b C
1 -0.97190653 -1.4863011 1
2 0.06589991 -1.1560120 2
3 -1.23987805 -0.8257228 1
4 1.87433300 -0.4954337 2
5 0.25276523 -0.1651446 1
6 -1.22045645 0.1651446 2
7 0.45507643 0.4954337 1
8 0.77649606 0.8257228 2
9 0.56826358 1.1560120 1
10 -0.56059319 1.4863011 2
> str(D)
'data.frame': 10 obs. of 3 variables:
$ A: num [1:10, 1] -0.9719 0.0659 -1.2399 1.8743 0.2528 ...
..- attr(*, "scaled:center")= num 0.132
..- attr(*, "scaled:scale")= num 0.781
$ b: num [1:10, 1] -1.486 -1.156 -0.826 -0.495 -0.165 ...
..- attr(*, "scaled:center")= num 5.5
..- attr(*, "scaled:scale")= num 3.03
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
>
scale/normalize columns by group
The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:
df %>%
group_by(Store) %>%
mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))
df %>%
group_by(Store) %>%
mutate_each(funs(normalit), Temperature, Sum_Sales)
Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.
How to standardize selected columns in panel data frame
We can use mutate_at
library(dplyr)
df %>%
group_by(unit) %>%
mutate_at(vars(X2, X4, X6, X7, X9), funs(as.vector(scale(.))))
Related Topics
How to See the Source Code of R .Internal or .Primitive Function
Tools For Making Latex Tables in R
How to Count Runs in a Sequence
Unordered Combinations of All Lengths
Why Does X[Y] Join of Data.Tables Not Allow a Full Outer Join, or a Left Join
How to Delete Rows from a Dataframe That Contain N*Na
How to Make Consistent-Width Plots in Ggplot (With Legends)
Putting Mathematical Symbols and Subscripts Mixed With Regular Letters
What Are the Main Differences Between R Data Files
How to Extract a Single Column from a Data.Frame as a Data.Frame
Simpler Population Pyramid in Ggplot2
Scatterplot With Marginal Histograms in Ggplot2
How to Convert Long to Wide Format With Counts
Subset Data Frame Based on Multiple Conditions
Ggplot, Drawing Line Between Points Across Facets
Add a Variable to a Data Frame Containing Max Value of Each Row