Normalize Data in Pandas

Normalize columns of a dataframe

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

Normalize data in pandas

In [92]: df
Out[92]:
a b c d
A -0.488816 0.863769 4.325608 -4.721202
B -11.937097 2.993993 -12.916784 -1.086236
C -5.569493 4.672679 -2.168464 -9.315900
D 8.892368 0.932785 4.535396 0.598124

In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())

In [94]: df_norm
Out[94]:
a b c d
A 0.085789 -0.394348 0.337016 -0.109935
B -0.463830 0.164926 -0.650963 0.256714
C -0.158129 0.605652 -0.035090 -0.573389
D 0.536170 -0.376229 0.349037 0.426611

In [95]: df_norm.mean()
Out[95]:
a -2.081668e-17
b 4.857226e-17
c 1.734723e-17
d -1.040834e-17

In [96]: df_norm.max() - df_norm.min()
Out[96]:
a 1
b 1
c 1
d 1

min max normalization dataframe in pandas

Use MinMaxScaler.

df = pd.DataFrame({'A': [1, 2, 5, 3], 'B': [10, 0, 3, 7], 'C': [100, 200, 50, 500]})
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler = scaler.fit(df)
scaler.transform(df)

Results

array([[0.        , 1.        , 0.11111111],
[0.25 , 0. , 0.33333333],
[1. , 0.3 , 0. ],
[0.5 , 0.7 , 1. ]])

Now using the same scaler on new data

df_new = pd.DataFrame({'A': [10, 15, 20], 'B': [18, 17, 15], 'C': [250, 300, 150]})
scaler.transform(df_new)

Results

array([[2.25      , 1.8       , 0.44444444],
[3.5 , 1.7 , 0.55555556],
[4.75 , 1.5 , 0.22222222]])

How can I normalize data in a pandas dataframe to the starting value of a time series?

IIUC, GroupBy.transform

df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.000000
1 A 2 47 1.068182
2 A 3 64 1.454545
3 B 1 67 1.000000
4 B 2 67 1.000000
5 B 3 9 0.134328
6 C 1 83 1.000000
7 C 2 21 0.253012
8 C 3 36 0.433735


df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first')).round(2)
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.00
1 A 2 47 1.07
2 A 3 64 1.45
3 B 1 67 1.00
4 B 2 67 1.00
5 B 3 9 0.13
6 C 1 83 1.00
7 C 2 21 0.25
8 C 3 36 0.43

If you need create a new DataFrame:

df2 = df.assign(Normalized = df['Parameter'].div(df.groupby('Patient')['Parameter'].transform('first')))

We could also use lambda as I suggested.

Or:

df2 = df.copy()
df2['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))

Normalize/scale dataframe in a certain range

We can use MinMaxScaler to perform feature scaling, MinMaxScaler supports a parameter called feature_range which allows us to specify the desired range of the transformed data

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0.6, 8.4))
df['normalized'] = scaler.fit_transform(df['wind power [W]'].values[:, None])

Alternatively if you don't want to use MinMaxScaler, here is a way scale data in pandas only:

w = df['wind power [W]'].agg(['min', 'max'])
norm = (df['wind power [W]'] - w['min']) / (w['max'] - w['min'])
df['normalized'] = norm * (8.4 - 0.6) + 0.6


print(df)

DateTime wind power [W] normalized
0 2022-02-08 00:00:00 83.9 8.400000
1 2022-02-08 00:10:00 57.2 2.598886
2 2022-02-08 00:20:00 58.2 2.816156
3 2022-02-08 00:30:00 48.0 0.600000
4 2022-02-08 00:40:00 69.5 5.271309

Un-Normalise Data Frame in Pandas

Transformers in sklearn have an inverse_transform method that does that. However you seem to do the normalization of features & target together so that can't be utilized as is. Therefore you can separate these:

# prepare two scalers
X_scaler = preprocessing.MinMaxScaler()
y_scaler = preprocessing.MinMaxScaler()

# features are everything but target
X = df.drop(columns="target")
y = df["target"]

# scale them separately
X_scaled = X_scaler.fit_transform(X)
y_scaled = y_scaler.fit_transform(y)

# training..
# ...

# prediction time
preds = ...
unnormalized_preds = y_scaler.inverse_transform(preds)

I want normalize data by dividing every row by the price on the first row

Use:

bova['Norm Close'] = bova['Close'] / bova['Close'][0]
print(bova[['Close', 'Norm Close']])

# Output
Close Norm Close
Date
2014-01-02 49.080002 1.000000
2014-01-03 49.259998 1.003667
2014-01-06 49.840000 1.015485
2014-01-07 49.230000 1.003056
2014-01-08 49.279999 1.004075
... ... ...
2021-12-23 100.849998 2.054808
2021-12-27 101.599998 2.070090
2021-12-28 101.059998 2.059087
2021-12-29 100.250000 2.042583
2021-12-30 100.800003 2.053790

[1986 rows x 2 columns]

Normalization Of single Column Of Dataframe

Try:

df1['Normalize'] = df1.groupby('Symbol')['Close'].transform(lambda x: x/x.iloc[0]).fillna(1)#.reset_index()

As commented by Shubham:

you can divide groups by its first value by

df['Close'] /= df1.groupby('Symbol')['Close'].transform('first')

:)


df1:

    Date        Symbol      Close   Normalize
0 2020-11-23 APLAPOLLO 3247.45 1.000000
1 2020-11-24 APLAPOLLO 3219.95 0.991532
2 2020-11-25 APLAPOLLO 3220.45 0.991686
3 2020-11-26 APLAPOLLO 3178.95 0.978907
4 2020-11-27 APLAPOLLO 3378.90 1.040478
5 2020-12-01 APLAPOLLO 3446.85 1.061402
6 2020-12-02 APLAPOLLO 3514.55 1.082249
7 2020-12-03 APLAPOLLO 3545.80 1.091872
8 2020-12-04 APLAPOLLO 3708.60 1.142004
9 2020-12-07 APLAPOLLO 3868.55 1.191258
10 2020-12-08 APLAPOLLO 3750.30 1.154845
11 2020-12-09 APLAPOLLO 3801.35 1.170565
12 2020-12-10 APLAPOLLO 3766.65 1.159879
13 2020-12-11 APLAPOLLO 3674.30 1.131442
14 2020-12-14 APLAPOLLO 3814.80 1.174706
15 2020-12-15 APLAPOLLO 780.55 0.240358
16 2020-12-16 APLAPOLLO 790.20 0.243329
17 2020-12-17 APLAPOLLO 791.20 0.243637
18 2020-12-18 APLAPOLLO 769.70 0.237017
19 2020-12-21 APLAPOLLO 726.60 0.223745
20 2020-12-22 APLAPOLLO 744.30 0.229195
21 2020-11-23 AUBANK 869.65 1.000000
22 2020-11-24 AUBANK 874.35 1.005404
23 2020-11-25 AUBANK 856.25 0.984592
24 2020-11-26 AUBANK 861.05 0.990111
25 2020-11-27 AUBANK 839.05 0.964813
26 2020-12-01 AUBANK 872.90 1.003737
27 2020-12-02 AUBANK 886.65 1.019548
28 2020-12-03 AUBANK 880.30 1.012246
29 2020-12-04 AUBANK 880.45 1.012419
30 2020-12-07 AUBANK 898.65 1.033347
31 2020-12-08 AUBANK 907.80 1.043868
32 2020-12-09 AUBANK 918.90 1.056632
33 2020-12-10 AUBANK 911.05 1.047605
34 2020-12-11 AUBANK 920.30 1.058242
35 2020-12-14 AUBANK 929.45 1.068763
36 2020-12-15 AUBANK 922.60 1.060887
37 2020-12-16 AUBANK 915.80 1.053067
38 2020-12-17 AUBANK 943.15 1.084517
39 2020-12-18 AUBANK 897.00 1.031449
40 2020-12-21 AUBANK 840.45 0.966423
41 2020-12-22 AUBANK 856.00 0.984304
42 2020-11-23 AARTIDRUGS 711.70 1.000000


Related Topics



Leave a reply



Submit