Normalize Columns of a Dataframe

Normalize columns of a dataframe

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

Normalization Of single Column Of Dataframe

Try:

df1['Normalize'] = df1.groupby('Symbol')['Close'].transform(lambda x: x/x.iloc[0]).fillna(1)#.reset_index()

As commented by Shubham:

you can divide groups by its first value by

df['Close'] /= df1.groupby('Symbol')['Close'].transform('first')

df1:

    Date        Symbol      Close   Normalize
0   2020-11-23  APLAPOLLO   3247.45 1.000000
1   2020-11-24  APLAPOLLO   3219.95 0.991532
2   2020-11-25  APLAPOLLO   3220.45 0.991686
3   2020-11-26  APLAPOLLO   3178.95 0.978907
4   2020-11-27  APLAPOLLO   3378.90 1.040478
5   2020-12-01  APLAPOLLO   3446.85 1.061402
6   2020-12-02  APLAPOLLO   3514.55 1.082249
7   2020-12-03  APLAPOLLO   3545.80 1.091872
8   2020-12-04  APLAPOLLO   3708.60 1.142004
9   2020-12-07  APLAPOLLO   3868.55 1.191258
10  2020-12-08  APLAPOLLO   3750.30 1.154845
11  2020-12-09  APLAPOLLO   3801.35 1.170565
12  2020-12-10  APLAPOLLO   3766.65 1.159879
13  2020-12-11  APLAPOLLO   3674.30 1.131442
14  2020-12-14  APLAPOLLO   3814.80 1.174706
15  2020-12-15  APLAPOLLO   780.55  0.240358
16  2020-12-16  APLAPOLLO   790.20  0.243329
17  2020-12-17  APLAPOLLO   791.20  0.243637
18  2020-12-18  APLAPOLLO   769.70  0.237017
19  2020-12-21  APLAPOLLO   726.60  0.223745
20  2020-12-22  APLAPOLLO   744.30  0.229195
21  2020-11-23  AUBANK      869.65  1.000000
22  2020-11-24  AUBANK      874.35  1.005404
23  2020-11-25  AUBANK      856.25  0.984592
24  2020-11-26  AUBANK      861.05  0.990111
25  2020-11-27  AUBANK      839.05  0.964813
26  2020-12-01  AUBANK      872.90  1.003737
27  2020-12-02  AUBANK      886.65  1.019548
28  2020-12-03  AUBANK      880.30  1.012246
29  2020-12-04  AUBANK      880.45  1.012419
30  2020-12-07  AUBANK      898.65  1.033347
31  2020-12-08  AUBANK      907.80  1.043868
32  2020-12-09  AUBANK      918.90  1.056632
33  2020-12-10  AUBANK      911.05  1.047605
34  2020-12-11  AUBANK      920.30  1.058242
35  2020-12-14  AUBANK      929.45  1.068763
36  2020-12-15  AUBANK      922.60  1.060887
37  2020-12-16  AUBANK      915.80  1.053067
38  2020-12-17  AUBANK      943.15  1.084517
39  2020-12-18  AUBANK      897.00  1.031449
40  2020-12-21  AUBANK      840.45  0.966423
41  2020-12-22  AUBANK      856.00  0.984304
42  2020-11-23  AARTIDRUGS  711.70  1.000000

How to normalize columns in a dataframe

Try:

In [5]: %paste                                                                                                                                                                                                                                                                       
cols = ['2002', '2003', '2004', '2005']
df[cols] = df[cols] / df[cols].sum()

## -- End pasted text --

In [6]: df                                                                                                                                                                                                                                                                           
Out[6]: 
      term      2002      2003      2004      2005
0  climate  0.043478  0.454545  0.333333  0.466667
1   global  0.521739  0.500000  0.666667  0.400000
2  nuclear  0.434783  0.045455  0.000000  0.133333

How to normalize all columns of pandas data frame but first/key

Convert first column to index, i.g. if name of first column is date:

print (df_data)
                  date  a  b  c
0  2019-08-13 00:30:00  1  2  3
1  2019-08-13 01:00:00  2  3  1
2  2019-08-13 01:30:00  1  1  1
3  2019-08-13 02:00:00  1  1  1

from sklearn import preprocessing

df_data = df_data.set_index('date')
x = df_data.to_numpy()
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_data = pd.DataFrame(x_scaled, columns=df_data.columns, index=df_data.index)
print (df_data)
                       a    b    c
date                              
2019-08-13 00:30:00  0.0  0.5  1.0
2019-08-13 01:00:00  1.0  1.0  0.0
2019-08-13 01:30:00  0.0  0.0  0.0
2019-08-13 02:00:00  0.0  0.0  0.0

In your solution select all columns without first by DataFrame.iloc, first : means all rows and 1: select al columns excluding first, use solution and last assign back:

from sklearn import preprocessing

x = df_data.iloc[:, 1:].to_numpy()

min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df_data.iloc[:, 1:] = x_scaled
print (df_data)
                  date    a    b    c
0  2019-08-13 00:30:00  0.0  0.5  1.0
1  2019-08-13 01:00:00  1.0  1.0  0.0
2  2019-08-13 01:30:00  0.0  0.0  0.0
3  2019-08-13 02:00:00  0.0  0.0  0.0

Normalizing the columns of a dataframe

Your code is run column-wise and it works correctly. However, if this was your question, there are other types of normalization, here are some that you might need:

Mean normalization (like you did):

normalized_df=(df-df.mean())/df.std()
          A         B    C         D
0  0.000000  1.305582 -0.5  0.866025
1 -0.707107 -0.783349 -0.5 -0.866025
2  1.414214  0.261116  1.5 -0.866025
3 -0.707107 -0.783349 -0.5  0.866025

Min-Max normalization:

normalized_df=(df-df.min())/(df.max()-df.min())
          A    B    C    D
0  0.333333  1.0  0.0  1.0
1  0.000000  0.0  0.0  0.0
2  1.000000  0.5  1.0  0.0
3  0.000000  0.0  0.0  1.0

Using sklearn.preprocessin you find a lot of normalization methods (and not only) ready, such as StandardScaler, MinMaxScaler or MaxAbsScaler:

Mean normalization using sklearn:

import pandas as pd
from sklearn import preprocessing

mean_scaler = preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
x_scaled = mean_scaler.fit_transform(df.values)
normalized_df = pd.DataFrame(x_scaled)

          0         1         2    3
0  0.000000  1.507557 -0.577350  1.0
1 -0.816497 -0.904534 -0.577350 -1.0
2  1.632993  0.301511  1.732051 -1.0
3 -0.816497 -0.904534 -0.577350  1.0

Min-Max normalization using sklearn MinMaxScaler:

import pandas as pd
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df.values)
normalized_df = pd.DataFrame(x_scaled)

          0    1    2    3
0  0.333333  1.0  0.0  1.0
1  0.000000  0.0  0.0  0.0
2  1.000000  0.5  1.0  0.0
3  0.000000  0.0  0.0  1.0

I hope I have helped you!

Normalize column: sum to 1

You can divide each values in the Weight column by sum of all the values in the Weight column,

df['Weight']/df['Weight'].sum()

0    0.285714
1    0.142857
2    0.071429
3    0.500000
Name: Weight, dtype: float64

How to normalize multiple columns of dicts in a pandas dataframe

Verify the columns are dict type, and not str type.
- If the columns are str type, convert them with ast.literal_eval.
Use pandas.json_normalize() to normaize each column of dicts
Use a list-comprehension to rename the columns.
Use pandas.concat() with axis=1 to combine the dataframes.

import pandas as pd
from ast import literal_eval

# test dataframe
data = {'time': ['2021-01-12 18:00:00', '2021-01-12 20:15:00', '2021-01-12 20:15:00', '2021-01-13 18:00:00', '2021-01-13 20:15:00', '2021-01-14 20:00:00', '2021-01-15 20:00:00', '2021-01-16 12:30:00', '2021-01-16 15:00:00'], 'home_team': ['Sheff Utd', 'Burnley', 'Wolverhampton', 'Man City', 'Aston Villa', 'Arsenal', 'Fulham', 'Wolverhampton', 'Leeds'], 'away_team': ['Newcastle', 'Man Utd', 'Everton', 'Brighton', 'Tottenham', 'Crystal Palace', 'Chelsea', 'West Brom', 'Brighton'], 'full_time_result': ["{'1': 2400, 'X': 3200, '2': 3100}", "{'1': 7000, 'X': 4500, '2': 1440}", "{'1': 2450, 'X': 3200, '2': 3000}", "{'1': 1180, 'X': 6500, '2': 14000}", "{'1': 2620, 'X': 3500, '2': 2500}", "{'1': 1500, 'X': 4000, '2': 6500}", "{'1': 5750, 'X': 4330, '2': 1530}", "{'1': 1440, 'X': 4200, '2': 7500}", "{'1': 2000, 'X': 3600, '2': 3600}"], 'both_teams_to_score': ["{'yes': 2000, 'no': 1750}", "{'yes': 1900, 'no': 1900}", "{'yes': 1950, 'no': 1800}", "{'yes': 2040, 'no': 1700}", "{'yes': 1570, 'no': 2250}", "{'yes': 1950, 'no': 1800}", "{'yes': 1800, 'no': 1950}", "{'yes': 2250, 'no': 1570}", "{'yes': 1530, 'no': 2370}"], 'double_chance': ["{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 2620, '12': 1180, '2X': 1100}", "{'1X': 1360, '12': 1360, '2X': 1530}", "{'1X': 1040, '12': 1110, '2X': 4500}", "{'1X': 1500, '12': 1280, '2X': 1440}", "{'1X': 1110, '12': 1220, '2X': 2500}", "{'1X': 2370, '12': 1200, '2X': 1140}", "{'1X': 1100, '12': 1220, '2X': 2620}", "{'1X': 1280, '12': 1280, '2X': 1720}"]}
df = pd.DataFrame(data)

# display(df.head(2))
                  time  home_team  away_team                   full_time_result        both_teams_to_score                         double_chance
0  2021-01-12 18:00:00  Sheff Utd  Newcastle  {'1': 2400, 'X': 3200, '2': 3100}  {'yes': 2000, 'no': 1750}  {'1X': 1360, '12': 1360, '2X': 1530}
1  2021-01-12 20:15:00    Burnley    Man Utd  {'1': 7000, 'X': 4500, '2': 1440}  {'yes': 1900, 'no': 1900}  {'1X': 2620, '12': 1180, '2X': 1100}

# convert time to datetime
df.time = pd.to_datetime(df.time)

# determine if columns are str or dict type
print(type(df.iloc[0, 3]))
[out]:
str

# convert columns from str to dict only if the columns are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)

# normalize columns and rename headers
ftr = pd.json_normalize(df.full_time_result)
ftr.columns = [f'full_time_result_{col}' for col in ftr.columns]

btts = pd.json_normalize(df.both_teams_to_score)
btts.columns = [f'both_teams_to_score_{col}' for col in btts.columns]

dc = pd.json_normalize(df.double_chance)
dc.columns = [f'double_chance_{col}' for col in dc.columns]

# concat the dataframes
df_normalized = pd.concat([df.iloc[:, :3], ftr, btts, dc], axis=1)

`display(df_normalized)`

                 time      home_team       away_team  full_time_result_1  full_time_result_X  full_time_result_2  both_teams_to_score_yes  both_teams_to_score_no  double_chance_1X  double_chance_12  double_chance_2X
0 2021-01-12 18:00:00      Sheff Utd       Newcastle                2400                3200                3100                     2000                    1750              1360              1360              1530
1 2021-01-12 20:15:00        Burnley         Man Utd                7000                4500                1440                     1900                    1900              2620              1180              1100
2 2021-01-12 20:15:00  Wolverhampton         Everton                2450                3200                3000                     1950                    1800              1360              1360              1530
3 2021-01-13 18:00:00       Man City        Brighton                1180                6500               14000                     2040                    1700              1040              1110              4500
4 2021-01-13 20:15:00    Aston Villa       Tottenham                2620                3500                2500                     1570                    2250              1500              1280              1440
5 2021-01-14 20:00:00        Arsenal  Crystal Palace                1500                4000                6500                     1950                    1800              1110              1220              2500
6 2021-01-15 20:00:00         Fulham         Chelsea                5750                4330                1530                     1800                    1950              2370              1200              1140
7 2021-01-16 12:30:00  Wolverhampton       West Brom                1440                4200                7500                     2250                    1570              1100              1220              2620
8 2021-01-16 15:00:00          Leeds        Brighton                2000                3600                3600                     1530                    2370              1280              1280              1720

Consolidated Code

# convert the columns to dict type if they are str type
df.iloc[:, 3:] = df.iloc[:, 3:].applymap(literal_eval)

# normalize all columns
df_list = list()

for col in df.columns[3:]:
    v = pd.json_normalize(df[col])
    v.columns = [f'{col}_{c}' for c in v.columns]
    df_list.append(v)

# combine into one dataframe
df_normalized = pd.concat([df.iloc[:, :3]] + df_list, axis=1)

Standardize data columns in R

I have to assume you meant to say that you wanted a mean of 0 and a standard deviation of 1. If your data is in a dataframe and all the columns are numeric you can simply call the scale function on the data to do what you want.

dat <- data.frame(x = rnorm(10, 30, .2), y = runif(10, 3, 5))
scaled.dat <- scale(dat)

# check that we get mean of 0 and sd of 1
colMeans(scaled.dat)  # faster version of apply(scaled.dat, 2, mean)
apply(scaled.dat, 2, sd)

Using built in functions is classy. Like this cat:

Sample Image

min max normalization dataframe in pandas

Use MinMaxScaler.

df = pd.DataFrame({'A': [1, 2, 5, 3], 'B': [10, 0, 3, 7], 'C': [100, 200, 50, 500]})
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler = scaler.fit(df)
scaler.transform(df)

Results

array([[0.        , 1.        , 0.11111111],
       [0.25      , 0.        , 0.33333333],
       [1.        , 0.3       , 0.        ],
       [0.5       , 0.7       , 1.        ]])

Now using the same scaler on new data

df_new = pd.DataFrame({'A': [10, 15, 20], 'B': [18, 17, 15], 'C': [250, 300, 150]})
scaler.transform(df_new)

Results

array([[2.25      , 1.8       , 0.44444444],
       [3.5       , 1.7       , 0.55555556],
       [4.75      , 1.5       , 0.22222222]])

Normalize Columns of a Dataframe