Scale/Normalize Columns by Group

scale/normalize columns by group

The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:

df %>%
    group_by(Store) %>%
    mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))

df %>%
    group_by(Store) %>%
    mutate_each(funs(normalit), Temperature, Sum_Sales)

Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.

Using `scale()` to normalize all numeric columns in a data.frame

We can do

j <- sapply(dat, is.numeric)
dat[j] <- scale(dat[j])

Normalize DataFrame by group

In [10]: df.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())

should do it.

Standardize data columns in R

I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want.

dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)

# check that we get mean of 0 and sd of 1
colMeans(scaled.dat)  # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)

Using built in functions is classy. Like this cat:

Sample Image

Normalize columns of a dataframe

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

How to standardize values in a column based on grouping by two other columns in R?

What you have done seems to be right. You cannot use summarise() as it returns a single value and not a vector.

From your question, I'm not sure if you want to scale the value for each group or find the sum of value for each group. I've sampled both the cases.

# Sample data
  age sex values
1 <10   M      1
2 <10   M      2
3 >10   F      3
4 >10   F      4
5 >10   M      5

# Scaling value
df %>% group_by(age, sex) %>% mutate(std_value = scale(values))
  age   sex   values std_value
  <fct> <fct>  <dbl>     <dbl>
1 <10   M          1    -0.707
2 <10   M          2     0.707
3 >10   F          3    -0.707
4 >10   F          4     0.707
5 >10   M          5   NaN

# Sum of values
df %>% group_by(age, sex) %>% mutate(sum_value = sum(values))
  age   sex   values sum_value
  <fct> <fct>  <dbl>     <dbl>
1 <10   M          1         3
2 <10   M          2         3
3 >10   F          3         7
4 >10   F          4         7
5 >10   M          5         5

Normalize by Group

Your desired output looks like you are wanting this:

df <- read.table(header=TRUE, text=
'ID    Item    StrengthCode
7     A       1
7     A       5
7     A       7
8     B       1
8     B       3
9     A       5
9     A       3')
df$Nor <- ave(df$StrengthCode, df$Item, FUN=function(x) x/max(x)) 
df
# > df
#   ID Item StrengthCode       Nor
# 1  7    A            1 0.1428571
# 2  7    A            5 0.7142857
# 3  7    A            7 1.0000000
# 4  8    B            1 0.3333333
# 5  8    B            3 1.0000000
# 6  9    A            5 0.7142857
# 7  9    A            3 0.4285714

With dplyr you can do (thx to Sotos for the comment+code):

library("dplyr")
(df %>% group_by(Item) %>% mutate(Nor = StrengthCode/max(StrengthCode))) 
# > (df %>% group_by(Item) %>% mutate(Nor = StrengthCode/max(StrengthCode)))
# Source: local data frame [7 x 4]
# Groups: Item [2]
# 
#      ID   Item StrengthCode       Nor
#   <int> <fctr>        <int>     <dbl>
# 1     7      A            1 0.1428571
# 2     7      A            5 0.7142857
# 3     7      A            7 1.0000000
# 4     8      B            1 0.3333333
# 5     8      B            3 1.0000000
# 6     9      A            5 0.7142857
# 7     9      A            3 0.4285714

Normalize a column of dataframe using min max normalization based on groupby of another column

You are almost there.

>>> df                                                                                                                 
    Name        Job  Salary
0   john    painter   40000
1  peter   engineer   50000
2    sam    plumber   30000
3   john     doctor  500000
4   john     driver   20000
5    sam  carpenter   10000
6  peter  scientist  100000
>>>                                                                                                                    
>>> result = df.assign(Salary=df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min())))
>>> # alternatively, df['Salary'] = df.groupby(... if you don't need a new frame       
>>> result                                                                                                               
    Name        Job    Salary
0   john    painter  0.041667
1  peter   engineer  0.000000
2    sam    plumber  1.000000
3   john     doctor  1.000000
4   john     driver  0.000000
5    sam  carpenter  0.000000
6  peter  scientist  1.000000

So basically, you just forgot to enclose x.max() - x.min() in parentheses.

Note that this can be done much faster with a series of vectorized operations.

>>> grouper = df.groupby('Name')['Salary']                                                                             
>>> maxes = grouper.transform('max')                                                                                   
>>> mins = grouper.transform('min')                                                                                    
>>>                                                                                                                    
>>> result = df.assign(Salary=(df.Salary - mins)/(maxes - mins))                                                       
>>> result                                                                                                             
    Name        Job    Salary
0   john    painter  0.041667
1  peter   engineer  0.000000
2    sam    plumber  1.000000
3   john     doctor  1.000000
4   john     driver  0.000000
5    sam  carpenter  0.000000
6  peter  scientist  1.000000

Timings:

>>> # Setup
>>> df = pd.concat([df]*1000, ignore_index=True)                                                                       
>>> df.Name = np.arange(len(df)//4).repeat(4) # 4 names per group                                                      
>>> df                                                                                                                 
      Name        Job  Salary
0        0    painter   40000
1        0   engineer   50000
2        0    plumber   30000
3        0     doctor  500000
4        1     driver   20000
...    ...        ...     ...
6995  1748    plumber   30000
6996  1749     doctor  500000
6997  1749     driver   20000
6998  1749  carpenter   10000
6999  1749  scientist  100000

[7000 rows x 3 columns]
>>>
>>> # Tests @ i5-6200U CPU @ 2.30GHz
>>> %timeit df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min()))                                 
1.19 s ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit 
...: grouper = df.groupby('Name')['Salary'] 
...: maxes = grouper.transform('max') 
...: mins = grouper.transform('min') 
...: (df.Salary - mins)/(maxes - mins) 
...:  
...:                                                                                                                   
3.04 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Min-max normalization in R, setting groups of min and max based on another column

library(tidyverse)

df %>%
  group_by(y) %>%
  mutate(xnorm = (x - min(x)) / (max(x) - min(x))) %>%
  ungroup()

Output:

# A tibble: 11 x 3
       x     y xnorm
   <dbl> <dbl> <dbl>
 1   0       1 0    
 2   0.5     1 0.2  
 3   1       1 0.4  
 4   2.5     1 1    
 5   0.2     2 0    
 6   0.3     2 0.333
 7   0.5     2 1    
 8   0       3 0    
 9   0       3 0    
10   0.1     3 0.143
11   0.7     3 1

Or, in the mutate() statement, you could put xnorm = scales::rescale(x)

Scale/Normalize Columns by Group