Standardize Data Columns in R

Standardize data columns in R

I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want.

dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)

# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)

Using built in functions is classy. Like this cat:

Sample Image

Standardize variables using dplyr [r]

It is not clear why you first make data in long format and then return back to wide, neither why you don't prefer scale(df) which is computationally much faster.

Anyway, if you really want to use a code similar to the one you like, you need to perform a further unnest action in order to return the data to the original format.

df %>% 
gather() %>%
group_by(key) %>%
mutate(value = value - mean(value)) %>%
ungroup() %>%
pivot_wider(names_from = key, values_from = value) %>%
unnest(everything())

# A tibble: 32 x 11
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.909 -0.188 -70.7 -36.7 0.303 -0.597 -1.39 -0.438 0.594 0.312 1.19
# 2 0.909 -0.188 -70.7 -36.7 0.303 -0.342 -0.829 -0.438 0.594 0.312 1.19
# 3 2.71 -2.19 -123. -53.7 0.253 -0.897 0.761 0.562 0.594 0.312 -1.81
# 4 1.31 -0.188 27.3 -36.7 -0.517 -0.00225 1.59 0.562 -0.406 -0.688 -1.81
# 5 -1.39 1.81 129. 28.3 -0.447 0.223 -0.829 -0.438 -0.406 -0.688 -0.812
# 6 -1.99 -0.188 -5.72 -41.7 -0.837 0.243 2.37 0.562 -0.406 -0.688 -1.81
# 7 -5.79 1.81 129. 98.3 -0.387 0.353 -2.01 -0.438 -0.406 -0.688 1.19
# 8 4.31 -2.19 -84.0 -84.7 0.0934 -0.0272 2.15 0.562 -0.406 0.312 -0.812
# 9 2.71 -2.19 -89.9 -51.7 0.323 -0.0673 5.05 0.562 -0.406 0.312 -0.812
# 10 -0.891 -0.188 -63.1 -23.7 0.323 0.223 0.451 0.562 -0.406 0.312 1.19
# ... with 22 more rows


EDIT

In order to use the most up-to-date tidyr functions, you should consider substituting gather, which is retired now, with pivot_longer as in the following code. The obtained result is the same.

df %>% 
pivot_longer(everything()) %>%
group_by(name) %>%
mutate(value = value - mean(value)) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value) %>%
unnest(everything())

How to standardize values in a column based on grouping by two other columns in R?

What you have done seems to be right. You cannot use summarise() as it returns a single value and not a vector.

From your question, I'm not sure if you want to scale the value for each group or find the sum of value for each group. I've sampled both the cases.

# Sample data
age sex values
1 <10 M 1
2 <10 M 2
3 >10 F 3
4 >10 F 4
5 >10 M 5

# Scaling value
df %>% group_by(age, sex) %>% mutate(std_value = scale(values))
age sex values std_value
<fct> <fct> <dbl> <dbl>
1 <10 M 1 -0.707
2 <10 M 2 0.707
3 >10 F 3 -0.707
4 >10 F 4 0.707
5 >10 M 5 NaN

# Sum of values
df %>% group_by(age, sex) %>% mutate(sum_value = sum(values))
age sex values sum_value
<fct> <fct> <dbl> <dbl>
1 <10 M 1 3
2 <10 M 2 3
3 >10 F 3 7
4 >10 F 4 7
5 >10 M 5 5

Standardize data columns in R in subgrups

You can use dplyr for this:

library(dplyr)

new_df <- df %>% group_by(salesPerson, quater) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup

To work around rows that return NAs, you can either keep the original values as they are or filter them out before scaling:

Keeping the original values (by keeping scaling only instances where NROW is greater than 1):

new_df <- df %>% group_by(salesPerson, quater) %>% 
mutate(scaled_Col = ifelse(NROW(salesValue) > 1, scale(salesValue), salesValue)) %>%
ungroup

Filtering them out (as suggested by @steveb):

new_df <- df %>% group_by(salesPerson, quater) %>% 
filter(n() > 1) %>%
mutate(scaled_Col = scale(salesValue)) %>%
ungroup

I hope this helps.

R Normalize Many Columns

Try scale like below

cbind(data, `colnames<-`(scale(data[normalize_these]), paste0(normalize_these, "NEW")))

If you would like to use data.table, below might be an option

setDT(data)
data[, paste0(normalize_these, "NEW") := lapply(.SD, scale), .SDcols = normalize_these]

Standardizing columns by groups using r

Could not use the example that you gave: So will use the iris dataset:

library(tidyverse)
iris %>%
group_by(Species) %>%
mutate(across(contains(c("Width","Length")),scale))
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length[,1] Sepal.Width[,1] Petal.Length[,1] Petal.Width[,1] Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 0.267 0.190 -0.357 -0.436 setosa
2 -0.301 -1.13 -0.357 -0.436 setosa
3 -0.868 -0.601 -0.933 -0.436 setosa
4 -1.15 -0.865 0.219 -0.436 setosa
5 -0.0170 0.454 -0.357 -0.436 setosa
6 1.12 1.25 1.37 1.46 setosa
7 -1.15 -0.0739 -0.357 0.512 setosa
8 -0.0170 -0.0739 0.219 -0.436 setosa
9 -1.72 -1.39 -0.357 -0.436 setosa
10 -0.301 -0.865 0.219 -1.39 setosa
# ... with 140 more rows

If you are still using the old format, you could do:

iris %>%
group_by(Species) %>%
mutate_at(vars(contains(c("Width","Length"))),scale)
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length[,1] Sepal.Width[,1] Petal.Length[,1] Petal.Width[,1] Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 0.267 0.190 -0.357 -0.436 setosa
2 -0.301 -1.13 -0.357 -0.436 setosa
3 -0.868 -0.601 -0.933 -0.436 setosa
4 -1.15 -0.865 0.219 -0.436 setosa
5 -0.0170 0.454 -0.357 -0.436 setosa
6 1.12 1.25 1.37 1.46 setosa
7 -1.15 -0.0739 -0.357 0.512 setosa
8 -0.0170 -0.0739 0.219 -0.436 setosa
9 -1.72 -1.39 -0.357 -0.436 setosa
10 -0.301 -0.865 0.219 -1.39 setosa
# ... with 140 more rows

How to standardize data in r before running regressions in r?

Here are a couple of options to create a logical index indicating whether or not an outlier is detected. I've considered an outlier to be greater than 3 sds above the mean, as you have indicated (not checking for "lesser than").

Using apply, specifying the argument margin=2 to apply the function column by column.

numericaldata <- datafinal[2:ncol(datafinal)]

apply(numericaldata, margin=2, function(x) {
x>3 * sd(x, na.rm = TRUE) + mean(x, na.rm = TRUE)
})%>%
cbind.data.frame(Date1=datafinal$Date1,.)

Using dplyr:

datafinal%>%
mutate(across(.cols = 2:ncol(.), ~ . > 3 * sd(., na.rm = TRUE) + mean(., na.rm = TRUE)))

Results:

       Date1     A     B     C     D     E
1 2017-01-01 FALSE FALSE FALSE NA FALSE
2 2017-02-01 FALSE FALSE FALSE NA FALSE
3 2017-03-01 FALSE FALSE FALSE FALSE FALSE
4 2017-04-01 FALSE FALSE FALSE FALSE FALSE

Should you want to know how many outliers were found, you can summarise the results like so:

results%>%
summarise(across(2:ncol(.), ~sum(., na.rm=TRUE)))

R Standardizing numeric variables in dataframe while retaining factor variables

You need to use rapply instead of sapply

set.seed(1)
> df=data.frame(A=rnorm(10),b=1:10,C=as.factor(rep(1:2,5)))
> str(df)
'data.frame': 10 obs. of 3 variables:
$ A: num -0.626 0.184 -0.836 1.595 0.33 ...
$ b: int 1 2 3 4 5 6 7 8 9 10
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2

The code you need to use:

> D=rapply(df,scale,c("numeric","integer"),how="replace")
> D
A b C
1 -0.97190653 -1.4863011 1
2 0.06589991 -1.1560120 2
3 -1.23987805 -0.8257228 1
4 1.87433300 -0.4954337 2
5 0.25276523 -0.1651446 1
6 -1.22045645 0.1651446 2
7 0.45507643 0.4954337 1
8 0.77649606 0.8257228 2
9 0.56826358 1.1560120 1
10 -0.56059319 1.4863011 2
> str(D)
'data.frame': 10 obs. of 3 variables:
$ A: num [1:10, 1] -0.9719 0.0659 -1.2399 1.8743 0.2528 ...
..- attr(*, "scaled:center")= num 0.132
..- attr(*, "scaled:scale")= num 0.781
$ b: num [1:10, 1] -1.486 -1.156 -0.826 -0.495 -0.165 ...
..- attr(*, "scaled:center")= num 5.5
..- attr(*, "scaled:scale")= num 3.03
$ C: Factor w/ 2 levels "1","2": 1 2 1 2 1 2 1 2 1 2
>

scale/normalize columns by group

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
group_by(Store) %>%
mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
group_by(Store) %>%
mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.

How to standardize selected columns in panel data frame

We can use mutate_at

library(dplyr)
df %>%
group_by(unit) %>%
mutate_at(vars(X2, X4, X6, X7, X9), funs(as.vector(scale(.))))


Related Topics



Leave a reply



Submit