Normalize Columns of a Dataframe

Normalize columns of a dataframe

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

Normalization Of single Column Of Dataframe

Try:

df1['Normalize'] = df1.groupby('Symbol')['Close'].transform(lambda x: x/x.iloc[0]).fillna(1)#.reset_index()

As commented by Shubham:

you can divide groups by its first value by

df['Close'] /= df1.groupby('Symbol')['Close'].transform('first')

:)


df1:

    Date        Symbol      Close   Normalize
0 2020-11-23 APLAPOLLO 3247.45 1.000000
1 2020-11-24 APLAPOLLO 3219.95 0.991532
2 2020-11-25 APLAPOLLO 3220.45 0.991686
3 2020-11-26 APLAPOLLO 3178.95 0.978907
4 2020-11-27 APLAPOLLO 3378.90 1.040478
5 2020-12-01 APLAPOLLO 3446.85 1.061402
6 2020-12-02 APLAPOLLO 3514.55 1.082249
7 2020-12-03 APLAPOLLO 3545.80 1.091872
8 2020-12-04 APLAPOLLO 3708.60 1.142004
9 2020-12-07 APLAPOLLO 3868.55 1.191258
10 2020-12-08 APLAPOLLO 3750.30 1.154845
11 2020-12-09 APLAPOLLO 3801.35 1.170565
12 2020-12-10 APLAPOLLO 3766.65 1.159879
13 2020-12-11 APLAPOLLO 3674.30 1.131442
14 2020-12-14 APLAPOLLO 3814.80 1.174706
15 2020-12-15 APLAPOLLO 780.55 0.240358
16 2020-12-16 APLAPOLLO 790.20 0.243329
17 2020-12-17 APLAPOLLO 791.20 0.243637
18 2020-12-18 APLAPOLLO 769.70 0.237017
19 2020-12-21 APLAPOLLO 726.60 0.223745
20 2020-12-22 APLAPOLLO 744.30 0.229195
21 2020-11-23 AUBANK 869.65 1.000000
22 2020-11-24 AUBANK 874.35 1.005404
23 2020-11-25 AUBANK 856.25 0.984592
24 2020-11-26 AUBANK 861.05 0.990111
25 2020-11-27 AUBANK 839.05 0.964813
26 2020-12-01 AUBANK 872.90 1.003737
27 2020-12-02 AUBANK 886.65 1.019548
28 2020-12-03 AUBANK 880.30 1.012246
29 2020-12-04 AUBANK 880.45 1.012419
30 2020-12-07 AUBANK 898.65 1.033347
31 2020-12-08 AUBANK 907.80 1.043868
32 2020-12-09 AUBANK 918.90 1.056632
33 2020-12-10 AUBANK 911.05 1.047605
34 2020-12-11 AUBANK 920.30 1.058242
35 2020-12-14 AUBANK 929.45 1.068763
36 2020-12-15 AUBANK 922.60 1.060887
37 2020-12-16 AUBANK 915.80 1.053067
38 2020-12-17 AUBANK 943.15 1.084517
39 2020-12-18 AUBANK 897.00 1.031449
40 2020-12-21 AUBANK 840.45 0.966423
41 2020-12-22 AUBANK 856.00 0.984304
42 2020-11-23 AARTIDRUGS 711.70 1.000000

How to normalize columns in a dataframe

Try:

In [5]: %paste                                                                                                                                                                                                                                                                       
cols = ['2002', '2003', '2004', '2005']
df[cols] = df[cols] / df[cols].sum()

## -- End pasted text --

In [6]: df
Out[6]:
term 2002 2003 2004 2005
0 climate 0.043478 0.454545 0.333333 0.466667
1 global 0.521739 0.500000 0.666667 0.400000
2 nuclear 0.434783 0.045455 0.000000 0.133333

How to normalize all columns of pandas data frame but first/key

Convert first column to index, i.g. if name of first column is date:

print (df_data)
date a b c
0 2019-08-13 00:30:00 1 2 3
1 2019-08-13 01:00:00 2 3 1
2 2019-08-13 01:30:00 1 1 1
3 2019-08-13 02:00:00 1 1 1

from sklearn import preprocessing

df_data = df_data.set_index('date')
x = df_data.to_numpy()
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_data = pd.DataFrame(x_scaled, columns=df_data.columns, index=df_data.index)
print (df_data)
a b c
date
2019-08-13 00:30:00 0.0 0.5 1.0
2019-08-13 01:00:00 1.0 1.0 0.0
2019-08-13 01:30:00 0.0 0.0 0.0
2019-08-13 02:00:00 0.0 0.0 0.0

In your solution select all columns without first by DataFrame.iloc, first : means all rows and 1: select al columns excluding first, use solution and last assign back:

from sklearn import preprocessing

x = df_data.iloc[:, 1:].to_numpy()

min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_data.iloc[:, 1:] = x_scaled
print (df_data)
date a b c
0 2019-08-13 00:30:00 0.0 0.5 1.0
1 2019-08-13 01:00:00 1.0 1.0 0.0
2 2019-08-13 01:30:00 0.0 0.0 0.0
3 2019-08-13 02:00:00 0.0 0.0 0.0

Normalizing the columns of a dataframe

Your code is run column-wise and it works correctly. However, if this was your question, there are other types of normalization, here are some that you might need:

Mean normalization (like you did):

normalized_df=(df-df.mean())/df.std()
A B C D
0 0.000000 1.305582 -0.5 0.866025
1 -0.707107 -0.783349 -0.5 -0.866025
2 1.414214 0.261116 1.5 -0.866025
3 -0.707107 -0.783349 -0.5 0.866025

Min-Max normalization:

normalized_df=(df-df.min())/(df.max()-df.min())
A B C D
0 0.333333 1.0 0.0 1.0
1 0.000000 0.0 0.0 0.0
2 1.000000 0.5 1.0 0.0
3 0.000000 0.0 0.0 1.0

Using sklearn.preprocessin you find a lot of normalization methods (and not only) ready, such as StandardScaler, MinMaxScaler or MaxAbsScaler:

Mean normalization using sklearn:

import pandas as pd
from sklearn import preprocessing

mean_scaler = preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
x_scaled = mean_scaler.fit_transform(df.values)
normalized_df = pd.DataFrame(x_scaled)

0 1 2 3
0 0.000000 1.507557 -0.577350 1.0
1 -0.816497 -0.904534 -0.577350 -1.0
2 1.632993 0.301511 1.732051 -1.0
3 -0.816497 -0.904534 -0.577350 1.0

Min-Max normalization using sklearn MinMaxScaler:

import pandas as pd
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df.values)
normalized_df = pd.DataFrame(x_scaled)

0 1 2 3
0 0.333333 1.0 0.0 1.0
1 0.000000 0.0 0.0 0.0
2 1.000000 0.5 1.0 0.0
3 0.000000 0.0 0.0 1.0

I hope I have helped you!

Normalize column: sum to 1

You can divide each values in the Weight column by sum of all the values in the Weight column,

df['Weight']/df['Weight'].sum()

0 0.285714
1 0.142857
2 0.071429
3 0.500000
Name: Weight, dtype: float64

How to normalize multiple columns of dicts in a pandas dataframe

  • Verify the columns are dict type, and not str type.
    • If the columns are str type, convert them with ast.literal_eval.
  • Use pandas.json_normalize() to normaize each column of dicts
  • Use a list-comprehension to rename the columns.
  • Use pandas.concat() with axis=1 to combine the dataframes.
import pandas as pd
from ast import literal_eval

# test dataframe
data = {'time': ['2021-01-12 18:00:00', '2021-01-12 20:15:00', '2021-01-12 20:15:00', '2021-01-13 18:00:00', '2021-01-13 20:15:00', '2021-01-14 20:00:00', '2021-01-15 20:00:00', '2021-01-16 12:30:00', '2021-01-16 15:00:00'], 'home_team': ['Sheff Utd', 'Burnley', 'Wolverhampton', 'Man City', 'Aston Villa', 'Arsenal', 'Fulham', 'Wolverhampton', 'Leeds'], 'away_team': ['Newcastle', 'Man Utd', 'Everton', 'Brighton', 'Tottenham', 'Crystal Palace', 'Chelsea', 'West Brom', 'Brighton'], 'full_time_result': ["{'1': 2400, 'X': 3200, '2': 3100}", "{'1': 7000, 'X': 4500, '2': 1440}", "{'1': 2450, 'X': 3200, '2': 3000}", "{'1': 1180, 'X': 6500, '2': 14000}", "{'1': 2620, 'X': 3500, '2': 2500}", "{'1': 1500, 'X': 4000, '2': 6500}", "{'1': 5750, 'X': 4330, '2': 1530}", "{'1': 1440, 'X': 4200, '2': 7500}", "{'1': 2000, 'X': 3600, '2': 3600}"], 'both_teams_to_score': ["{'yes': 2000, 'no': 1750}", "{'yes': 1900, 'no': 1900}", "{'yes': 1950, 'no': 1800}", "{'yes': 2040, 'no': 1700}", "{'yes': 1570, 'no': 2250}", "{'yes': 1950, 'no': 1800}", "{'yes': 1800, 'no': 1950}", "{'yes': 2250, 'no': 1570}", "{'yes': 1530, 'no': 2370}"], 'double_chance': ["{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 2620, '12': 1180, '2X': 1100}", "{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 1040, '12': 1110, '2X': 4500}", "{'1X': 1500, '12': 1280, '2X': 1440}", "{'1X': 1110, '12': 1220, '2X': 2500}", "{'1X': 2370, '12': 1200, '2X': 1140}", "{'1X': 1100, '12': 1220, '2X': 2620}", "{'1X': 1280, '12': 1280, '2X': 1720}"]}
df = pd.DataFrame(data)

# display(df.head(2))
time home_team away_team full_time_result both_teams_to_score double_chance
0 2021-01-12 18:00:00 Sheff Utd Newcastle {'1': 2400, 'X': 3200, '2': 3100} {'yes': 2000, 'no': 1750} {'1X': 1360, '12': 1360, '2X': 1530}
1 2021-01-12 20:15:00 Burnley Man Utd {'1': 7000, 'X': 4500, '2': 1440} {'yes': 1900, 'no': 1900} {'1X': 2620, '12': 1180, '2X': 1100}

# convert time to datetime
df.time = pd.to_datetime(df.time)

# determine if columns are str or dict type
print(type(df.iloc[0, 3]))
[out]:
str

# convert columns from str to dict only if the columns are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)

# normalize columns and rename headers
ftr = pd.json_normalize(df.full_time_result)
ftr.columns = [f'full_time_result_{col}' for col in ftr.columns]

btts = pd.json_normalize(df.both_teams_to_score)
btts.columns = [f'both_teams_to_score_{col}' for col in btts.columns]

dc = pd.json_normalize(df.double_chance)
dc.columns = [f'double_chance_{col}' for col in dc.columns]

# concat the dataframes
df_normalized = pd.concat([df.iloc[:, :3], ftr, btts, dc], axis=1)

display(df_normalized)

                 time      home_team       away_team  full_time_result_1  full_time_result_X  full_time_result_2  both_teams_to_score_yes  both_teams_to_score_no  double_chance_1X  double_chance_12  double_chance_2X
0 2021-01-12 18:00:00 Sheff Utd Newcastle 2400 3200 3100 2000 1750 1360 1360 1530
1 2021-01-12 20:15:00 Burnley Man Utd 7000 4500 1440 1900 1900 2620 1180 1100
2 2021-01-12 20:15:00 Wolverhampton Everton 2450 3200 3000 1950 1800 1360 1360 1530
3 2021-01-13 18:00:00 Man City Brighton 1180 6500 14000 2040 1700 1040 1110 4500
4 2021-01-13 20:15:00 Aston Villa Tottenham 2620 3500 2500 1570 2250 1500 1280 1440
5 2021-01-14 20:00:00 Arsenal Crystal Palace 1500 4000 6500 1950 1800 1110 1220 2500
6 2021-01-15 20:00:00 Fulham Chelsea 5750 4330 1530 1800 1950 2370 1200 1140
7 2021-01-16 12:30:00 Wolverhampton West Brom 1440 4200 7500 2250 1570 1100 1220 2620
8 2021-01-16 15:00:00 Leeds Brighton 2000 3600 3600 1530 2370 1280 1280 1720

Consolidated Code

# convert the columns to dict type if they are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)

# normalize all columns
df_list = list()

for col in df.columns[3:]:
v = pd.json_normalize(df[col])
v.columns = [f'{col}_{c}' for c in v.columns]
df_list.append(v)

# combine into one dataframe
df_normalized = pd.concat([df.iloc[:, :3]] + df_list, axis=1)

Standardize data columns in R

I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want.

dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)

# check that we get mean of 0 and sd of 1
colMeans(scaled.dat) # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)

Using built in functions is classy. Like this cat:

Sample Image

min max normalization dataframe in pandas

Use MinMaxScaler.

df = pd.DataFrame({'A': [1, 2, 5, 3], 'B': [10, 0, 3, 7], 'C': [100, 200, 50, 500]})
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler = scaler.fit(df)
scaler.transform(df)

Results

array([[0.        , 1.        , 0.11111111],
[0.25 , 0. , 0.33333333],
[1. , 0.3 , 0. ],
[0.5 , 0.7 , 1. ]])

Now using the same scaler on new data

df_new = pd.DataFrame({'A': [10, 15, 20], 'B': [18, 17, 15], 'C': [250, 300, 150]})
scaler.transform(df_new)

Results

array([[2.25      , 1.8       , 0.44444444],
[3.5 , 1.7 , 0.55555556],
[4.75 , 1.5 , 0.22222222]])


Related Topics



Leave a reply



Submit