Scale/Normalize Columns by Group

scale/normalize columns by group

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
group_by(Store) %>%
mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
group_by(Store) %>%
mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.

Using `scale()` to normalize all numeric columns in a data.frame

We can do

j <- sapply(dat, is.numeric)
dat[j] <- scale(dat[j])

Normalize DataFrame by group

In [10]: df.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())

should do it.

Standardize data columns in R

I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want.

dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)

# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)

Using built in functions is classy. Like this cat:

Sample Image

Normalize columns of a dataframe

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

How to standardize values in a column based on grouping by two other columns in R?

What you have done seems to be right. You cannot use summarise() as it returns a single value and not a vector.

From your question, I'm not sure if you want to scale the value for each group or find the sum of value for each group. I've sampled both the cases.

# Sample data
age sex values
1 <10 M 1
2 <10 M 2
3 >10 F 3
4 >10 F 4
5 >10 M 5

# Scaling value
df %>% group_by(age, sex) %>% mutate(std_value = scale(values))
age sex values std_value
<fct> <fct> <dbl> <dbl>
1 <10 M 1 -0.707
2 <10 M 2 0.707
3 >10 F 3 -0.707
4 >10 F 4 0.707
5 >10 M 5 NaN

# Sum of values
df %>% group_by(age, sex) %>% mutate(sum_value = sum(values))
age sex values sum_value
<fct> <fct> <dbl> <dbl>
1 <10 M 1 3
2 <10 M 2 3
3 >10 F 3 7
4 >10 F 4 7
5 >10 M 5 5

Normalize by Group

Your desired output looks like you are wanting this:

df <- read.table(header=TRUE, text=
'ID Item StrengthCode
7 A 1
7 A 5
7 A 7
8 B 1
8 B 3
9 A 5
9 A 3')
df$Nor <- ave(df$StrengthCode, df$Item, FUN=function(x) x/max(x))
df
# > df
# ID Item StrengthCode Nor
# 1 7 A 1 0.1428571
# 2 7 A 5 0.7142857
# 3 7 A 7 1.0000000
# 4 8 B 1 0.3333333
# 5 8 B 3 1.0000000
# 6 9 A 5 0.7142857
# 7 9 A 3 0.4285714

With dplyr you can do (thx to Sotos for the comment+code):

library("dplyr")
(df %>% group_by(Item) %>% mutate(Nor = StrengthCode/max(StrengthCode)))
# > (df %>% group_by(Item) %>% mutate(Nor = StrengthCode/max(StrengthCode)))
# Source: local data frame [7 x 4]
# Groups: Item [2]
#
# ID Item StrengthCode Nor
# <int> <fctr> <int> <dbl>
# 1 7 A 1 0.1428571
# 2 7 A 5 0.7142857
# 3 7 A 7 1.0000000
# 4 8 B 1 0.3333333
# 5 8 B 3 1.0000000
# 6 9 A 5 0.7142857
# 7 9 A 3 0.4285714

Normalize a column of dataframe using min max normalization based on groupby of another column

You are almost there.

>>> df                                                                                                                 
Name Job Salary
0 john painter 40000
1 peter engineer 50000
2 sam plumber 30000
3 john doctor 500000
4 john driver 20000
5 sam carpenter 10000
6 peter scientist 100000
>>>
>>> result = df.assign(Salary=df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min())))
>>> # alternatively, df['Salary'] = df.groupby(... if you don't need a new frame
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000

So basically, you just forgot to enclose x.max() - x.min() in parentheses.


Note that this can be done much faster with a series of vectorized operations.

>>> grouper = df.groupby('Name')['Salary']                                                                             
>>> maxes = grouper.transform('max')
>>> mins = grouper.transform('min')
>>>
>>> result = df.assign(Salary=(df.Salary - mins)/(maxes - mins))
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000

Timings:

>>> # Setup
>>> df = pd.concat([df]*1000, ignore_index=True)
>>> df.Name = np.arange(len(df)//4).repeat(4) # 4 names per group
>>> df
Name Job Salary
0 0 painter 40000
1 0 engineer 50000
2 0 plumber 30000
3 0 doctor 500000
4 1 driver 20000
... ... ... ...
6995 1748 plumber 30000
6996 1749 doctor 500000
6997 1749 driver 20000
6998 1749 carpenter 10000
6999 1749 scientist 100000

[7000 rows x 3 columns]
>>>
>>> # Tests @ i5-6200U CPU @ 2.30GHz
>>> %timeit df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min()))
1.19 s ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
...: grouper = df.groupby('Name')['Salary']
...: maxes = grouper.transform('max')
...: mins = grouper.transform('min')
...: (df.Salary - mins)/(maxes - mins)
...:
...:
3.04 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Min-max normalization in R, setting groups of min and max based on another column

library(tidyverse)

df %>%
group_by(y) %>%
mutate(xnorm = (x - min(x)) / (max(x) - min(x))) %>%
ungroup()

Output:

# A tibble: 11 x 3
x y xnorm
<dbl> <dbl> <dbl>
1 0 1 0
2 0.5 1 0.2
3 1 1 0.4
4 2.5 1 1
5 0.2 2 0
6 0.3 2 0.333
7 0.5 2 1
8 0 3 0
9 0 3 0
10 0.1 3 0.143
11 0.7 3 1

Or, in the mutate() statement, you could put xnorm = scales::rescale(x)



Related Topics



Leave a reply



Submit