Python: Scaling Numbers Column by Column With Pandas

Python: Scaling numbers column by column with pandas

You could subtract by the min, then divide by the max (beware 0/0). Note that after subtracting the min, the new max is the original max - min.

In [11]: df
Out[11]:
a b
A 14 103
B 90 107
C 90 110
D 96 114
E 91 114

In [12]: df -= df.min() # equivalent to df = df - df.min()

In [13]: df /= df.max() # equivalent to df = df / df.max()

In [14]: df
Out[14]:
a b
A 0.000000 0.000000
B 0.926829 0.363636
C 0.926829 0.636364
D 1.000000 1.000000
E 0.939024 1.000000

To switch the order of a column (from 1 to 0 rather than 0 to 1):

In [15]: df['b'] = 1 - df['b']

An alternative method is to negate the b columns first (df['b'] = -df['b']).

how to scale columns by column from another Pandas dataframe

Assuming the columns are unique, and there are no duplicates in scaling, you could use map:

df.mul(df.columns.map(scaling.set_index("id").scaling))

A B C
0 0.2 1.2 2.8
1 0.4 1.5 3.2
2 0.6 1.8 3.6

Scaling pandas column to be between specified min and max numbers

Just change a, b = 10, 50 to a, b = 0, 1 in linked answer for upper and lower values for scale:

a, b = 0, 1
x, y = df.Frequency.min(), df.Frequency.max()
df['normal'] = (df.Frequency - x) / (y - x) * (b - a) + a
print (df)
Frequency normal
0 20 1.000000
1 14 0.684211
2 10 0.473684
3 8 0.368421
4 6 0.263158
5 2 0.052632
6 1 0.000000

Scaling / Normalizing pandas column

Option 1

sklearn

You see this problem time and time again, the error really should be indicative of what you need to do. You're basically missing a superfluous dimension on the input. Change df["TOTAL"] to df[["TOTAL"]].

df['SIZE'] = scaler.fit_transform(df[["TOTAL"]])
df
TOTAL Name SIZE
0 3232 Jane 24.413959
1 382 Jack 10.000000
2 8291 Jones 50.000000

Option 2

pandas

Preferably, I would bypass sklearn and just do the min-max scaling myself.

a, b = 10, 50
x, y = df.TOTAL.min(), df.TOTAL.max()
df['SIZE'] = (df.TOTAL - x) / (y - x) * (b - a) + a
df
TOTAL Name SIZE
0 3232 Jane 24.413959
1 382 Jack 10.000000
2 8291 Jones 50.000000

This is essentially what the min-max scaler does, but without the overhead of importing scikit learn (don't do it unless you have to, it's a heavy library).

pandas dataframe columns scaling with sklearn

I am not sure if previous versions of pandas prevented this but now the following snippet works perfectly for me and produces exactly what you want without having to use apply

>>> import pandas as pd
>>> from sklearn.preprocessing import MinMaxScaler


>>> scaler = MinMaxScaler()

>>> dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],
'B':[103.02,107.26,110.35,114.23,114.68],
'C':['big','small','big','small','small']})

>>> dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A', 'B']])

>>> dfTest
A B C
0 0.000000 0.000000 big
1 0.926219 0.363636 small
2 0.935335 0.628645 big
3 1.000000 0.961407 small
4 0.938495 1.000000 small

Scaling euler number in Pandas Column

This is the scientific notation of Pandas and is it's way of dealing with very large or small floats.

Although not necessary, multiple methods exist if you wish to convert your floats to another format:

1. use apply()

df.apply(lambda x: '%.5f' %x, axis=1)

2. set the global options of pandas

pd.set_option('display.float_format', lambda x: '%.5f' %x)

3. use df.round(). This only works if you have very small numbers with a lot of dcimals

df.round(2)

Normalize/scale dataframe in a certain range

We can use MinMaxScaler to perform feature scaling, MinMaxScaler supports a parameter called feature_range which allows us to specify the desired range of the transformed data

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0.6, 8.4))
df['normalized'] = scaler.fit_transform(df['wind power [W]'].values[:, None])

Alternatively if you don't want to use MinMaxScaler, here is a way scale data in pandas only:

w = df['wind power [W]'].agg(['min', 'max'])
norm = (df['wind power [W]'] - w['min']) / (w['max'] - w['min'])
df['normalized'] = norm * (8.4 - 0.6) + 0.6


print(df)

DateTime wind power [W] normalized
0 2022-02-08 00:00:00 83.9 8.400000
1 2022-02-08 00:10:00 57.2 2.598886
2 2022-02-08 00:20:00 58.2 2.816156
3 2022-02-08 00:30:00 48.0 0.600000
4 2022-02-08 00:40:00 69.5 5.271309

Normalize columns of pandas data frame

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

import pandas as pd
from sklearn import preprocessing

x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

How to re-scale a column by percentage change and start from a given number

c[1] = 100
for i in range(2, 5):
c[i] = c[i-1] * (1+b[i])

The way you allocate/assign to c is incorrect.

You first need to allocate c of appropriate length, then assign the first element, which is 0, not 1, and the loop should start from 1. Arrays/lists in Python are 0-indexed, meaning an array of length 5 is counted from 0-4.

Try this:

a = pd.Series([4, 5, 6, 3, 2])

# no need for the fillna, as the first element is never used
# it is better to leave it as NaN to avoid confusion with no change
b = a.pct_change()

c = pd.Series([0] * len(a))
c[0] = 100
for i in range(1, len(a)):
c[i] = c[i-1] * (1+ b[i])

For the chosen a, you get the following c:

0    100
1 125
2 150
3 75
4 50
dtype: int64

Note that you cannot get rid of the for-loop, because your calculation has a sequential dependence (depends on the previous element); vectorisation requires every element be calculated independently. If someone else has a vectorised solution, I would be happy to know.

Python : Scale columns in pandas dataframe

Multiple DataFrame with dictionary, working well if keys are same like columns names:

df = df.mul(scalingDictionary)    
print (df)
a b c
0 20.0 15.0 0.1
1 40.0 30.0 0.2
2 60.0 45.0 0.3
3 80.0 60.0 0.4

If some columns not match:

scalingDictionary = {'a': 10, 'b': 5} 

df = pd.DataFrame({'a':[2,4,6,8], 'b':[3,6,9,12], 'c':[1,2,3,4]})

df = df.mul(pd.Series(scalingDictionary).reindex(df.columns, fill_value=1))
print (df)
a b c
0 20 15 1
1 40 30 2
2 60 45 3
3 80 60 4

Or:

df = df.mul({**dict.fromkeys(df.columns, 1), **scalingDictionary})
print (df)
a b c
0 20 15 1
1 40 30 2
2 60 45 3
3 80 60 4


Related Topics



Leave a reply



Submit