scale/normalize columns by group
The issue is that you are using the wrong dplyr verb. Summarize will create one result per group per variable. What you want is mutate. Mutate changes variables and returns a result of the same length as the original. See http://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html. Below two approaches using dplyr:
df %>%
group_by(Store) %>%
mutate(Temperature = normalit(Temperature), Sum_Sales = normalit(Sum_Sales))
df %>%
group_by(Store) %>%
mutate_each(funs(normalit), Temperature, Sum_Sales)
Note: The Store variable is different between your data and desired result. I assumed that @jlhoward got the right data.
Using `scale()` to normalize all numeric columns in a data.frame
We can do
j <- sapply(dat, is.numeric)
dat[j] <- scale(dat[j])
Normalize DataFrame by group
In [10]: df.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
should do it.
Standardize data columns in R
I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale
function on the data to do what you want.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)
# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)
Using built in functions is classy. Like this cat:
Normalize columns of a dataframe
You can use the package sklearn and its associated preprocessing utilities to normalize the data.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)
For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.
How to standardize values in a column based on grouping by two other columns in R?
What you have done seems to be right. You cannot use summarise()
as it returns a single value and not a vector.
From your question, I'm not sure if you want to scale the value for each group or find the sum of value for each group. I've sampled both the cases.
# Sample data
age sex values
1 <10 M 1
2 <10 M 2
3 >10 F 3
4 >10 F 4
5 >10 M 5
# Scaling value
df %>% group_by(age, sex) %>% mutate(std_value = scale(values))
age sex values std_value
<fct> <fct> <dbl> <dbl>
1 <10 M 1 -0.707
2 <10 M 2 0.707
3 >10 F 3 -0.707
4 >10 F 4 0.707
5 >10 M 5 NaN
# Sum of values
df %>% group_by(age, sex) %>% mutate(sum_value = sum(values))
age sex values sum_value
<fct> <fct> <dbl> <dbl>
1 <10 M 1 3
2 <10 M 2 3
3 >10 F 3 7
4 >10 F 4 7
5 >10 M 5 5
Normalize by Group
Your desired output looks like you are wanting this:
df <- read.table(header=TRUE, text=
'ID Item StrengthCode
7 A 1
7 A 5
7 A 7
8 B 1
8 B 3
9 A 5
9 A 3')
df$Nor <- ave(df$StrengthCode, df$Item, FUN=function(x) x/max(x))
df
# > df
# ID Item StrengthCode Nor
# 1 7 A 1 0.1428571
# 2 7 A 5 0.7142857
# 3 7 A 7 1.0000000
# 4 8 B 1 0.3333333
# 5 8 B 3 1.0000000
# 6 9 A 5 0.7142857
# 7 9 A 3 0.4285714
With dplyr
you can do (thx to Sotos for the comment+code):
library("dplyr")
(df %>% group_by(Item) %>% mutate(Nor = StrengthCode/max(StrengthCode)))
# > (df %>% group_by(Item) %>% mutate(Nor = StrengthCode/max(StrengthCode)))
# Source: local data frame [7 x 4]
# Groups: Item [2]
#
# ID Item StrengthCode Nor
# <int> <fctr> <int> <dbl>
# 1 7 A 1 0.1428571
# 2 7 A 5 0.7142857
# 3 7 A 7 1.0000000
# 4 8 B 1 0.3333333
# 5 8 B 3 1.0000000
# 6 9 A 5 0.7142857
# 7 9 A 3 0.4285714
Normalize a column of dataframe using min max normalization based on groupby of another column
You are almost there.
>>> df
Name Job Salary
0 john painter 40000
1 peter engineer 50000
2 sam plumber 30000
3 john doctor 500000
4 john driver 20000
5 sam carpenter 10000
6 peter scientist 100000
>>>
>>> result = df.assign(Salary=df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min())))
>>> # alternatively, df['Salary'] = df.groupby(... if you don't need a new frame
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000
So basically, you just forgot to enclose x.max() - x.min()
in parentheses.
Note that this can be done much faster with a series of vectorized operations.
>>> grouper = df.groupby('Name')['Salary']
>>> maxes = grouper.transform('max')
>>> mins = grouper.transform('min')
>>>
>>> result = df.assign(Salary=(df.Salary - mins)/(maxes - mins))
>>> result
Name Job Salary
0 john painter 0.041667
1 peter engineer 0.000000
2 sam plumber 1.000000
3 john doctor 1.000000
4 john driver 0.000000
5 sam carpenter 0.000000
6 peter scientist 1.000000
Timings:
>>> # Setup
>>> df = pd.concat([df]*1000, ignore_index=True)
>>> df.Name = np.arange(len(df)//4).repeat(4) # 4 names per group
>>> df
Name Job Salary
0 0 painter 40000
1 0 engineer 50000
2 0 plumber 30000
3 0 doctor 500000
4 1 driver 20000
... ... ... ...
6995 1748 plumber 30000
6996 1749 doctor 500000
6997 1749 driver 20000
6998 1749 carpenter 10000
6999 1749 scientist 100000
[7000 rows x 3 columns]
>>>
>>> # Tests @ i5-6200U CPU @ 2.30GHz
>>> %timeit df.groupby('Name').transform(lambda x: (x - x.min()) / (x.max()- x.min()))
1.19 s ± 20.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %%timeit
...: grouper = df.groupby('Name')['Salary']
...: maxes = grouper.transform('max')
...: mins = grouper.transform('min')
...: (df.Salary - mins)/(maxes - mins)
...:
...:
3.04 ms ± 94.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Min-max normalization in R, setting groups of min and max based on another column
library(tidyverse)
df %>%
group_by(y) %>%
mutate(xnorm = (x - min(x)) / (max(x) - min(x))) %>%
ungroup()
Output:
# A tibble: 11 x 3
x y xnorm
<dbl> <dbl> <dbl>
1 0 1 0
2 0.5 1 0.2
3 1 1 0.4
4 2.5 1 1
5 0.2 2 0
6 0.3 2 0.333
7 0.5 2 1
8 0 3 0
9 0 3 0
10 0.1 3 0.143
11 0.7 3 1
Or, in the mutate()
statement, you could put xnorm = scales::rescale(x)
Related Topics
Importing Data into R from Google Spreadsheet
What Are 'User' and 'System' Times Measuring in R System.Time(Exp) Output
How to Convert a Huge List-Of-Vector to a Matrix More Efficiently
How to Do a Data.Table Merge Operation
In Ggplot2, How to Add Additional Legend
Formatting Mouse Over Labels in Plotly When Using Ggplotly
Diagnosing R Package Build Warning: "Latex Errors When Creating PDF Version"
How to Add Boxplots to Scatterplot with Jitter
Hide Certain Columns in a Responsive Data Table Using Dt Package
Using 'Rvest' to Extract Links
Replace Character at Certain Location Within String
How to Add a Prefix to Several Variable Names Using Dplyr