Weighted Standard Deviation in Numpy

Weighted standard deviation in NumPy

How about the following short "manual calculation"?

def weighted_avg_and_std(values, weights):
"""
Return the weighted average and standard deviation.

values, weights -- Numpy ndarrays with the same shape.
"""
average = numpy.average(values, weights=weights)
# Fast and numerically precise:
variance = numpy.average((values-average)**2, weights=weights)
return (average, math.sqrt(variance))

Calculate weighted statistical moments in Python

I think you have already listed all the ingredients that you need, following the formulas in the link you provided:

import numpy as np

a = np.array([[1,2],[2,5],[3,6],[4,12],[5,1]])
values, weights = a.T

def n_weighted_moment(values, weights, n):

assert n>0 & (values.shape == weights.shape)
w_avg = np.average(values, weights = weights)
w_var = np.sum(weights * (values - w_avg)**2)/np.sum(weights)

if n==1:
return w_avg
elif n==2:
return w_var
else:
w_std = np.sqrt(w_var)
return np.sum(weights * ((values - w_avg)/w_std)**n)/np.sum(weights)
#Same as np.average(((values - w_avg)/w_std)**n, weights=weights)

Which results in:

for n in range(1,5):
print(f'Moment {n} value is {n_weighted_moment(values, weights, n)}')

Moment 1 value is 3.1923076923076925
Moment 2 value is 1.0784023668639053
Moment 3 value is -0.5962505715592139
Moment 4 value is 2.384432138280637

Notice that while you are calculating the excess kurtosis, the formula implemented for a generic n-moment doesn't account for that.

Getting weighted average and standard deviation on several columns in Pandas

You could use EOL's NumPy-based code
to calculate weighted averages and standard deviation. To use this in a Pandas groupby/apply operation, make weighted_average_std return a DataFrame:

import numpy as np
import pandas as pd

def weighted_average_std(grp):
"""
Based on http://stackoverflow.com/a/2415343/190597 (EOL)
"""
tmp = grp.select_dtypes(include=[np.number])
weights = tmp['Weight']
values = tmp.drop('Weight', axis=1)
average = np.ma.average(values, weights=weights, axis=0)
variance = np.dot(weights, (values - average) ** 2) / weights.sum()
std = np.sqrt(variance)
return pd.DataFrame({'mean':average, 'std':std}, index=values.columns)

np.random.seed(0)
df = pd.DataFrame({
"Date": pd.date_range(start='2018-01-01', end='2018-01-03 18:00:00', freq='6H'),
"Weight": np.random.uniform(3, 5, 12),
"V1": np.random.uniform(10, 15, 12),
"V2": np.random.uniform(10, 15, 12),
"V3": np.random.uniform(10, 15, 12)})

df.index = df["Date"]
df_agg = df.groupby(pd.Grouper(freq='1D')).apply(weighted_average_std).unstack(-1)
print(df_agg)

yields

                 mean                             std                    
V1 V2 V3 V1 V2 V3
Date
2018-01-01 12.105253 12.314079 13.566136 1.803014 1.725761 0.679279
2018-01-02 13.223172 12.534893 11.860456 1.709583 0.950338 1.153895
2018-01-03 13.782625 12.013557 12.105231 0.969099 1.189149 1.249064

Python - calculate weighted rolling standard deviation

As far as I understand, the chained function after the rolling method is a function that takes an array and gives a number. That function is calculated for each window. So, if we have a function that calculates the weighted-std, we can use it with a lambda function to get the rolling-weighted-std. Here is my take. (I hope I didn't make a mistake with weighted-std calculation you provided)

import pandas as pd
import numpy as np

def weighted_std(values, weights):
# For simplicity, assume len(values) == len(weights)
# assume all weights > 0
sum_of_weights = np.sum(weights)
weighted_average = np.sum(values * weights) / sum_of_weights
n = len(weights)
numerator = np.sum(n * weights * (values - weighted_average) ** 2.0)
denominator = (n - 1) * sum_of_weights
weighted_std = np.sqrt(numerator / denominator)
return weighted_std

def rolling_std(s, weights):
window_size = len(weights)
return s.rolling(center=False, window=window_size).apply(lambda win: weighted_std(win, weights))

s = pd.Series(np.random.random([10])) # generate random data
w = np.array([1., 3., 5.]) # choose weights
print(s.values)
print(rolling_std(s, w).values)

Example output:

[ 0.08101966  0.57133241  0.29491028  0.25139964  0.26151065  0.45768199
0.94459935 0.21534497 0.35999294 0.60242746]
[ nan nan 0.19701963 0.11936639 0.01539041 0.12097725
0.33346742 0.40784167 0.25884732 0.17709334]

Here lambda win: weighted_std(win, weights) is a function that takes an array as an input and returns a number.

Calculating weighted average in Pandas using NumPy function

Are you looking to group the weighted average by id ?

df.groupby('id').apply(lambda x: np.average(x['b'],weights=x['a'])).reset_index(name='Weighted Average')
Out[1]:
id Weighted Average
0 2 23.878049
1 3 33.166667
2 5 20.975610

Or if you want to do the weighted average of a / b:

(df.groupby('id').apply(lambda x: np.average(x['a']/x['b'],weights=x['a']))
.reset_index(name='Weighted Average'))
Out[2]:
id Weighted Average
0 2 1.754146
1 3 1.504274
2 5 1.962528

Calculating weighted mean and standard deviation

R provides weighted mean. In fact, ?weighted.mean shows this example:

 ## GPA from Siegel 1994
wt <- c(5, 5, 4, 1)/15
x <- c(3.7,3.3,3.5,2.8)
xm <- weighted.mean(x, wt)

One more step:

v <- sum(wt * (x - xm)^2)

Numpy std (standard deviation) function weird behavior

The Numpy documentation for std states:

The standard deviation is the square root of the average of the
squared deviations from the mean, i.e., std = sqrt(mean(abs(x - x.mean())**2)).

The average squared deviation is normally calculated as x.sum() / N,
where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead. In standard statistical practice, ddof=1
provides an unbiased estimator of the variance of the infinite
population. ddof=0 provides a maximum likelihood estimate of the
variance for normally distributed variables. The standard deviation
computed in this function is the square root of the estimated
variance, so even with ddof=1, it will not be an unbiased estimate
of the standard deviation per se.

Note that, for complex numbers, std takes the absolute value before
squaring, so that the result is always real and nonnegative.

For floating-point input, the std is computed using the same precision
the input has. Depending on the input data, this can cause the results
to be inaccurate, especially for float32 (see example below).
Specifying a higher-accuracy accumulator using the dtype keyword can
alleviate this issue.

a = np.zeros((2, 512*512), dtype=np.float32) 
a[0, :] = 1.0
a[1, :] = 0.1 np.std(a)
>>>0.45000005

but for float64:

a = np.zeros((2, 512*512), dtype=np.float64) 
a[0, :] = 1.0
a[1, :] = 0.1
np.std(a)
>>>0.45


Related Topics



Leave a reply



Submit