﻿ Python: Scaling Numbers Column by Column With Pandas - ITCodar

# Python: Scaling Numbers Column by Column With Pandas

## Python: Scaling numbers column by column with pandas

You could subtract by the min, then divide by the max (beware 0/0). Note that after subtracting the min, the new max is the original max - min.

``In : dfOut:    a    bA  14  103B  90  107C  90  110D  96  114E  91  114In : df -= df.min()  # equivalent to df = df - df.min()In : df /= df.max()  # equivalent to df = df / df.max()In : dfOut:          a         bA  0.000000  0.000000B  0.926829  0.363636C  0.926829  0.636364D  1.000000  1.000000E  0.939024  1.000000``

To switch the order of a column (from 1 to 0 rather than 0 to 1):

``In : df['b'] = 1 - df['b']``

An alternative method is to negate the b columns first (`df['b'] = -df['b']`).

## how to scale columns by column from another Pandas dataframe

Assuming the columns are unique, and there are no duplicates in `scaling`, you could use `map`:

``df.mul(df.columns.map(scaling.set_index("id").scaling))     A    B    C0  0.2  1.2  2.81  0.4  1.5  3.22  0.6  1.8  3.6``

## Scaling pandas column to be between specified min and max numbers

Just change `a, b = 10, 50` to `a, b = 0, 1` in linked answer for upper and lower values for scale:

``a, b = 0, 1x, y = df.Frequency.min(), df.Frequency.max()df['normal'] = (df.Frequency - x) / (y - x) * (b - a) + aprint (df)   Frequency    normal0         20  1.0000001         14  0.6842112         10  0.4736843          8  0.3684214          6  0.2631585          2  0.0526326          1  0.000000``

## Scaling / Normalizing pandas column

Option 1

`sklearn`

You see this problem time and time again, the error really should be indicative of what you need to do. You're basically missing a superfluous dimension on the input. Change `df["TOTAL"]` to `df[["TOTAL"]]`.

``df['SIZE'] = scaler.fit_transform(df[["TOTAL"]])``
``df   TOTAL   Name       SIZE0   3232   Jane  24.4139591    382   Jack  10.0000002   8291  Jones  50.000000``

Option 2

`pandas`

Preferably, I would bypass sklearn and just do the min-max scaling myself.

``a, b = 10, 50x, y = df.TOTAL.min(), df.TOTAL.max()df['SIZE'] = (df.TOTAL - x) / (y - x) * (b - a) + a``
``df   TOTAL   Name       SIZE0   3232   Jane  24.4139591    382   Jack  10.0000002   8291  Jones  50.000000``

This is essentially what the min-max scaler does, but without the overhead of importing scikit learn (don't do it unless you have to, it's a heavy library).

## pandas dataframe columns scaling with sklearn

I am not sure if previous versions of `pandas` prevented this but now the following snippet works perfectly for me and produces exactly what you want without having to use `apply`

``>>> import pandas as pd>>> from sklearn.preprocessing import MinMaxScaler>>> scaler = MinMaxScaler()>>> dfTest = pd.DataFrame({'A':[14.00,90.20,90.95,96.27,91.21],                           'B':[103.02,107.26,110.35,114.23,114.68],                           'C':['big','small','big','small','small']})>>> dfTest[['A', 'B']] = scaler.fit_transform(dfTest[['A', 'B']])>>> dfTest          A         B      C0  0.000000  0.000000    big1  0.926219  0.363636  small2  0.935335  0.628645    big3  1.000000  0.961407  small4  0.938495  1.000000  small``

## Scaling euler number in Pandas Column

This is the scientific notation of `Pandas` and is it's way of dealing with very large or small `floats`.

Although not necessary, multiple methods exist if you wish to convert your floats to another format:

1. use `apply()`

``df.apply(lambda x: '%.5f' %x, axis=1)``

2. set the global options of pandas

``pd.set_option('display.float_format', lambda x: '%.5f' %x)``

3. use `df.round()`. This only works if you have very small numbers with a lot of dcimals

``df.round(2)``

## Normalize/scale dataframe in a certain range

We can use `MinMaxScaler` to perform feature scaling, `MinMaxScaler` supports a parameter called `feature_range` which allows us to specify the desired range of the transformed data

``from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler(feature_range=(0.6, 8.4))df['normalized'] = scaler.fit_transform(df['wind power [W]'].values[:, None])``

Alternatively if you don't want to use `MinMaxScaler`, here is a way scale data in pandas only:

``w = df['wind power [W]'].agg(['min', 'max'])norm = (df['wind power [W]'] - w['min']) / (w['max'] - w['min'])df['normalized'] = norm * (8.4 - 0.6) + 0.6``

``print(df)             DateTime  wind power [W]  normalized0 2022-02-08 00:00:00            83.9    8.4000001 2022-02-08 00:10:00            57.2    2.5988862 2022-02-08 00:20:00            58.2    2.8161563 2022-02-08 00:30:00            48.0    0.6000004 2022-02-08 00:40:00            69.5    5.271309``

## Normalize columns of pandas data frame

You can use the package sklearn and its associated preprocessing utilities to normalize the data.

``import pandas as pdfrom sklearn import preprocessingx = df.values #returns a numpy arraymin_max_scaler = preprocessing.MinMaxScaler()x_scaled = min_max_scaler.fit_transform(x)df = pd.DataFrame(x_scaled)``

For more information look at the scikit-learn documentation on preprocessing data: scaling features to a range.

## How to re-scale a column by percentage change and start from a given number

``c = 100for i in range(2, 5):    c[i] = c[i-1] * (1+b[i])``

The way you allocate/assign to `c` is incorrect.

You first need to allocate `c` of appropriate length, then assign the first element, which is `0`, not `1`, and the loop should start from 1. Arrays/lists in Python are 0-indexed, meaning an array of length 5 is counted from 0-4.

Try this:

``a = pd.Series([4, 5, 6, 3, 2])# no need for the fillna, as the first element is never used# it is better to leave it as NaN to avoid confusion with no changeb = a.pct_change()c = pd.Series( * len(a))c = 100for i in range(1, len(a)):     c[i] = c[i-1] * (1+ b[i])``

For the chosen `a`, you get the following `c`:

``0    1001    1252    1503     754     50dtype: int64``

Note that you cannot get rid of the for-loop, because your calculation has a sequential dependence (depends on the previous element); vectorisation requires every element be calculated independently. If someone else has a vectorised solution, I would be happy to know.

## Python : Scale columns in pandas dataframe

Multiple DataFrame with dictionary, working well if keys are same like columns names:

``df = df.mul(scalingDictionary)    print (df)      a     b    c0  20.0  15.0  0.11  40.0  30.0  0.22  60.0  45.0  0.33  80.0  60.0  0.4       ``

If some columns not match:

``scalingDictionary = {'a': 10, 'b': 5} df = pd.DataFrame({'a':[2,4,6,8], 'b':[3,6,9,12], 'c':[1,2,3,4]})df = df.mul(pd.Series(scalingDictionary).reindex(df.columns, fill_value=1))print (df)    a   b  c0  20  15  11  40  30  22  60  45  33  80  60  4``

Or:

``df = df.mul({**dict.fromkeys(df.columns, 1), **scalingDictionary})print (df)    a   b  c0  20  15  11  40  30  22  60  45  33  80  60  4``