Normalize columns of a dataframe
You can use the package sklearn and its associated preprocessing utilities to normalize the data.
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)
For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.
Normalization Of single Column Of Dataframe
Try:
df1['Normalize'] = df1.groupby('Symbol')['Close'].transform(lambda x: x/x.iloc[0]).fillna(1)#.reset_index()
As commented by Shubham:
you can divide groups by its first value by
df['Close'] /= df1.groupby('Symbol')['Close'].transform('first')
:)
df1:
Date Symbol Close Normalize
0 2020-11-23 APLAPOLLO 3247.45 1.000000
1 2020-11-24 APLAPOLLO 3219.95 0.991532
2 2020-11-25 APLAPOLLO 3220.45 0.991686
3 2020-11-26 APLAPOLLO 3178.95 0.978907
4 2020-11-27 APLAPOLLO 3378.90 1.040478
5 2020-12-01 APLAPOLLO 3446.85 1.061402
6 2020-12-02 APLAPOLLO 3514.55 1.082249
7 2020-12-03 APLAPOLLO 3545.80 1.091872
8 2020-12-04 APLAPOLLO 3708.60 1.142004
9 2020-12-07 APLAPOLLO 3868.55 1.191258
10 2020-12-08 APLAPOLLO 3750.30 1.154845
11 2020-12-09 APLAPOLLO 3801.35 1.170565
12 2020-12-10 APLAPOLLO 3766.65 1.159879
13 2020-12-11 APLAPOLLO 3674.30 1.131442
14 2020-12-14 APLAPOLLO 3814.80 1.174706
15 2020-12-15 APLAPOLLO 780.55 0.240358
16 2020-12-16 APLAPOLLO 790.20 0.243329
17 2020-12-17 APLAPOLLO 791.20 0.243637
18 2020-12-18 APLAPOLLO 769.70 0.237017
19 2020-12-21 APLAPOLLO 726.60 0.223745
20 2020-12-22 APLAPOLLO 744.30 0.229195
21 2020-11-23 AUBANK 869.65 1.000000
22 2020-11-24 AUBANK 874.35 1.005404
23 2020-11-25 AUBANK 856.25 0.984592
24 2020-11-26 AUBANK 861.05 0.990111
25 2020-11-27 AUBANK 839.05 0.964813
26 2020-12-01 AUBANK 872.90 1.003737
27 2020-12-02 AUBANK 886.65 1.019548
28 2020-12-03 AUBANK 880.30 1.012246
29 2020-12-04 AUBANK 880.45 1.012419
30 2020-12-07 AUBANK 898.65 1.033347
31 2020-12-08 AUBANK 907.80 1.043868
32 2020-12-09 AUBANK 918.90 1.056632
33 2020-12-10 AUBANK 911.05 1.047605
34 2020-12-11 AUBANK 920.30 1.058242
35 2020-12-14 AUBANK 929.45 1.068763
36 2020-12-15 AUBANK 922.60 1.060887
37 2020-12-16 AUBANK 915.80 1.053067
38 2020-12-17 AUBANK 943.15 1.084517
39 2020-12-18 AUBANK 897.00 1.031449
40 2020-12-21 AUBANK 840.45 0.966423
41 2020-12-22 AUBANK 856.00 0.984304
42 2020-11-23 AARTIDRUGS 711.70 1.000000
How to normalize columns in a dataframe
Try:
In [5]: %paste
cols = ['2002', '2003', '2004', '2005']
df[cols] = df[cols] / df[cols].sum()
## -- End pasted text --
In [6]: df
Out[6]:
term 2002 2003 2004 2005
0 climate 0.043478 0.454545 0.333333 0.466667
1 global 0.521739 0.500000 0.666667 0.400000
2 nuclear 0.434783 0.045455 0.000000 0.133333
How to normalize all columns of pandas data frame but first/key
Convert first column to index, i.g. if name of first column is date
:
print (df_data)
date a b c
0 2019-08-13 00:30:00 1 2 3
1 2019-08-13 01:00:00 2 3 1
2 2019-08-13 01:30:00 1 1 1
3 2019-08-13 02:00:00 1 1 1
from sklearn import preprocessing
df_data = df_data.set_index('date')
x = df_data.to_numpy()
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_data = pd.DataFrame(x_scaled, columns=df_data.columns, index=df_data.index)
print (df_data)
a b c
date
2019-08-13 00:30:00 0.0 0.5 1.0
2019-08-13 01:00:00 1.0 1.0 0.0
2019-08-13 01:30:00 0.0 0.0 0.0
2019-08-13 02:00:00 0.0 0.0 0.0
In your solution select all columns without first by DataFrame.iloc
, first :
means all rows and 1:
select al columns excluding first, use solution and last assign back:
from sklearn import preprocessing
x = df_data.iloc[:, 1:].to_numpy()
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_data.iloc[:, 1:] = x_scaled
print (df_data)
date a b c
0 2019-08-13 00:30:00 0.0 0.5 1.0
1 2019-08-13 01:00:00 1.0 1.0 0.0
2 2019-08-13 01:30:00 0.0 0.0 0.0
3 2019-08-13 02:00:00 0.0 0.0 0.0
Normalizing the columns of a dataframe
Your code is run column-wise and it works correctly. However, if this was your question, there are other types of normalization, here are some that you might need:
Mean normalization (like you did):
normalized_df=(df-df.mean())/df.std()
A B C D
0 0.000000 1.305582 -0.5 0.866025
1 -0.707107 -0.783349 -0.5 -0.866025
2 1.414214 0.261116 1.5 -0.866025
3 -0.707107 -0.783349 -0.5 0.866025
Min-Max normalization:
normalized_df=(df-df.min())/(df.max()-df.min())
A B C D
0 0.333333 1.0 0.0 1.0
1 0.000000 0.0 0.0 0.0
2 1.000000 0.5 1.0 0.0
3 0.000000 0.0 0.0 1.0
Using sklearn.preprocessin you find a lot of normalization methods (and not only) ready, such as StandardScaler, MinMaxScaler or MaxAbsScaler:
Mean normalization using sklearn:
import pandas as pd
from sklearn import preprocessing
mean_scaler = preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
x_scaled = mean_scaler.fit_transform(df.values)
normalized_df = pd.DataFrame(x_scaled)
0 1 2 3
0 0.000000 1.507557 -0.577350 1.0
1 -0.816497 -0.904534 -0.577350 -1.0
2 1.632993 0.301511 1.732051 -1.0
3 -0.816497 -0.904534 -0.577350 1.0
Min-Max normalization using sklearn MinMaxScaler:
import pandas as pd
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df.values)
normalized_df = pd.DataFrame(x_scaled)
0 1 2 3
0 0.333333 1.0 0.0 1.0
1 0.000000 0.0 0.0 0.0
2 1.000000 0.5 1.0 0.0
3 0.000000 0.0 0.0 1.0
I hope I have helped you!
Normalize column: sum to 1
You can divide each values in the Weight
column by sum
of all the values in the Weight
column,
df['Weight']/df['Weight'].sum()
0 0.285714
1 0.142857
2 0.071429
3 0.500000
Name: Weight, dtype: float64
How to normalize multiple columns of dicts in a pandas dataframe
- Verify the columns are
dict
type, and notstr
type.- If the columns are
str
type, convert them withast.literal_eval
.
- If the columns are
- Use
pandas.json_normalize()
to normaize each column ofdicts
- Use a list-comprehension to rename the columns.
- Use
pandas.concat()
withaxis=1
to combine the dataframes.
import pandas as pd
from ast import literal_eval
# test dataframe
data = {'time': ['2021-01-12 18:00:00', '2021-01-12 20:15:00', '2021-01-12 20:15:00', '2021-01-13 18:00:00', '2021-01-13 20:15:00', '2021-01-14 20:00:00', '2021-01-15 20:00:00', '2021-01-16 12:30:00', '2021-01-16 15:00:00'], 'home_team': ['Sheff Utd', 'Burnley', 'Wolverhampton', 'Man City', 'Aston Villa', 'Arsenal', 'Fulham', 'Wolverhampton', 'Leeds'], 'away_team': ['Newcastle', 'Man Utd', 'Everton', 'Brighton', 'Tottenham', 'Crystal Palace', 'Chelsea', 'West Brom', 'Brighton'], 'full_time_result': ["{'1': 2400, 'X': 3200, '2': 3100}", "{'1': 7000, 'X': 4500, '2': 1440}", "{'1': 2450, 'X': 3200, '2': 3000}", "{'1': 1180, 'X': 6500, '2': 14000}", "{'1': 2620, 'X': 3500, '2': 2500}", "{'1': 1500, 'X': 4000, '2': 6500}", "{'1': 5750, 'X': 4330, '2': 1530}", "{'1': 1440, 'X': 4200, '2': 7500}", "{'1': 2000, 'X': 3600, '2': 3600}"], 'both_teams_to_score': ["{'yes': 2000, 'no': 1750}", "{'yes': 1900, 'no': 1900}", "{'yes': 1950, 'no': 1800}", "{'yes': 2040, 'no': 1700}", "{'yes': 1570, 'no': 2250}", "{'yes': 1950, 'no': 1800}", "{'yes': 1800, 'no': 1950}", "{'yes': 2250, 'no': 1570}", "{'yes': 1530, 'no': 2370}"], 'double_chance': ["{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 2620, '12': 1180, '2X': 1100}", "{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 1040, '12': 1110, '2X': 4500}", "{'1X': 1500, '12': 1280, '2X': 1440}", "{'1X': 1110, '12': 1220, '2X': 2500}", "{'1X': 2370, '12': 1200, '2X': 1140}", "{'1X': 1100, '12': 1220, '2X': 2620}", "{'1X': 1280, '12': 1280, '2X': 1720}"]}
df = pd.DataFrame(data)
# display(df.head(2))
time home_team away_team full_time_result both_teams_to_score double_chance
0 2021-01-12 18:00:00 Sheff Utd Newcastle {'1': 2400, 'X': 3200, '2': 3100} {'yes': 2000, 'no': 1750} {'1X': 1360, '12': 1360, '2X': 1530}
1 2021-01-12 20:15:00 Burnley Man Utd {'1': 7000, 'X': 4500, '2': 1440} {'yes': 1900, 'no': 1900} {'1X': 2620, '12': 1180, '2X': 1100}
# convert time to datetime
df.time = pd.to_datetime(df.time)
# determine if columns are str or dict type
print(type(df.iloc[0, 3]))
[out]:
str
# convert columns from str to dict only if the columns are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)
# normalize columns and rename headers
ftr = pd.json_normalize(df.full_time_result)
ftr.columns = [f'full_time_result_{col}' for col in ftr.columns]
btts = pd.json_normalize(df.both_teams_to_score)
btts.columns = [f'both_teams_to_score_{col}' for col in btts.columns]
dc = pd.json_normalize(df.double_chance)
dc.columns = [f'double_chance_{col}' for col in dc.columns]
# concat the dataframes
df_normalized = pd.concat([df.iloc[:, :3], ftr, btts, dc], axis=1)
display(df_normalized)
time home_team away_team full_time_result_1 full_time_result_X full_time_result_2 both_teams_to_score_yes both_teams_to_score_no double_chance_1X double_chance_12 double_chance_2X
0 2021-01-12 18:00:00 Sheff Utd Newcastle 2400 3200 3100 2000 1750 1360 1360 1530
1 2021-01-12 20:15:00 Burnley Man Utd 7000 4500 1440 1900 1900 2620 1180 1100
2 2021-01-12 20:15:00 Wolverhampton Everton 2450 3200 3000 1950 1800 1360 1360 1530
3 2021-01-13 18:00:00 Man City Brighton 1180 6500 14000 2040 1700 1040 1110 4500
4 2021-01-13 20:15:00 Aston Villa Tottenham 2620 3500 2500 1570 2250 1500 1280 1440
5 2021-01-14 20:00:00 Arsenal Crystal Palace 1500 4000 6500 1950 1800 1110 1220 2500
6 2021-01-15 20:00:00 Fulham Chelsea 5750 4330 1530 1800 1950 2370 1200 1140
7 2021-01-16 12:30:00 Wolverhampton West Brom 1440 4200 7500 2250 1570 1100 1220 2620
8 2021-01-16 15:00:00 Leeds Brighton 2000 3600 3600 1530 2370 1280 1280 1720
Consolidated Code
# convert the columns to dict type if they are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)
# normalize all columns
df_list = list()
for col in df.columns[3:]:
v = pd.json_normalize(df[col])
v.columns = [f'{col}_{c}' for c in v.columns]
df_list.append(v)
# combine into one dataframe
df_normalized = pd.concat([df.iloc[:, :3]] + df_list, axis=1)
Standardize data columns in R
I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale
function on the data to do what you want.
dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)
# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)
Using built in functions is classy. Like this cat:
min max normalization dataframe in pandas
Use MinMaxScaler.
df = pd.DataFrame({'A': [1, 2, 5, 3], 'B': [10, 0, 3, 7], 'C': [100, 200, 50, 500]})
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler = scaler.fit(df)
scaler.transform(df)
Results
array([[0. , 1. , 0.11111111],
[0.25 , 0. , 0.33333333],
[1. , 0.3 , 0. ],
[0.5 , 0.7 , 1. ]])
Now using the same scaler on new data
df_new = pd.DataFrame({'A': [10, 15, 20], 'B': [18, 17, 15], 'C': [250, 300, 150]})
scaler.transform(df_new)
Results
array([[2.25 , 1.8 , 0.44444444],
[3.5 , 1.7 , 0.55555556],
[4.75 , 1.5 , 0.22222222]])
Related Topics
How to Extract All the Emojis from Text
Reconstruct a Categorical Variable from Dummies in Pandas
How to Make Program Go Back to the Top of the Code Instead of Closing
What's the Deal with Python 3.4, Unicode, Different Languages and Windows
List on Python Appending Always the Same Value
Cartesian Product of Two Lists in Python
Make 2 Functions Run at the Same Time
Does Python Urllib2 Automatically Uncompress Gzip Data Fetched from Webpage
How to Add Percentages on Top of Bars in Seaborn
Python CSV Error: Line Contains Null Byte
Pandas Select from Dataframe Using Startswith
Run a .Bat File Using Python Code
Merging Several Python Dictionaries
Color by Column Values in Matplotlib
Finding a Key Recursively in a Dictionary
Why Does Assigning to My Global Variables Not Work in Python