Weighted standard deviation in NumPy
How about the following short "manual calculation"?
def weighted_avg_and_std(values, weights):
"""
Return the weighted average and standard deviation.
values, weights -- Numpy ndarrays with the same shape.
"""
average = numpy.average(values, weights=weights)
# Fast and numerically precise:
variance = numpy.average((values-average)**2, weights=weights)
return (average, math.sqrt(variance))
Calculate weighted statistical moments in Python
I think you have already listed all the ingredients that you need, following the formulas in the link you provided:
import numpy as np
a = np.array([[1,2],[2,5],[3,6],[4,12],[5,1]])
values, weights = a.T
def n_weighted_moment(values, weights, n):
assert n>0 & (values.shape == weights.shape)
w_avg = np.average(values, weights = weights)
w_var = np.sum(weights * (values - w_avg)**2)/np.sum(weights)
if n==1:
return w_avg
elif n==2:
return w_var
else:
w_std = np.sqrt(w_var)
return np.sum(weights * ((values - w_avg)/w_std)**n)/np.sum(weights)
#Same as np.average(((values - w_avg)/w_std)**n, weights=weights)
Which results in:for n in range(1,5):
print(f'Moment {n} value is {n_weighted_moment(values, weights, n)}')
Moment 1 value is 3.1923076923076925
Moment 2 value is 1.0784023668639053
Moment 3 value is -0.5962505715592139
Moment 4 value is 2.384432138280637
Notice that while you are calculating the excess kurtosis, the formula implemented for a generic n-moment doesn't account for that. Getting weighted average and standard deviation on several columns in Pandas
You could use EOL's NumPy-based code
to calculate weighted averages and standard deviation. To use this in a Pandas groupby/apply
operation, make weighted_average_std
return a DataFrame:
import numpy as np
import pandas as pd
def weighted_average_std(grp):
"""
Based on http://stackoverflow.com/a/2415343/190597 (EOL)
"""
tmp = grp.select_dtypes(include=[np.number])
weights = tmp['Weight']
values = tmp.drop('Weight', axis=1)
average = np.ma.average(values, weights=weights, axis=0)
variance = np.dot(weights, (values - average) ** 2) / weights.sum()
std = np.sqrt(variance)
return pd.DataFrame({'mean':average, 'std':std}, index=values.columns)
np.random.seed(0)
df = pd.DataFrame({
"Date": pd.date_range(start='2018-01-01', end='2018-01-03 18:00:00', freq='6H'),
"Weight": np.random.uniform(3, 5, 12),
"V1": np.random.uniform(10, 15, 12),
"V2": np.random.uniform(10, 15, 12),
"V3": np.random.uniform(10, 15, 12)})
df.index = df["Date"]
df_agg = df.groupby(pd.Grouper(freq='1D')).apply(weighted_average_std).unstack(-1)
print(df_agg)
yields mean std
V1 V2 V3 V1 V2 V3
Date
2018-01-01 12.105253 12.314079 13.566136 1.803014 1.725761 0.679279
2018-01-02 13.223172 12.534893 11.860456 1.709583 0.950338 1.153895
2018-01-03 13.782625 12.013557 12.105231 0.969099 1.189149 1.249064
Python - calculate weighted rolling standard deviation
As far as I understand, the chained function after the rolling
method is a function that takes an array and gives a number. That function is calculated for each window. So, if we have a function that calculates the weighted-std, we can use it with a lambda function to get the rolling-weighted-std. Here is my take. (I hope I didn't make a mistake with weighted-std calculation you provided)
import pandas as pd
import numpy as np
def weighted_std(values, weights):
# For simplicity, assume len(values) == len(weights)
# assume all weights > 0
sum_of_weights = np.sum(weights)
weighted_average = np.sum(values * weights) / sum_of_weights
n = len(weights)
numerator = np.sum(n * weights * (values - weighted_average) ** 2.0)
denominator = (n - 1) * sum_of_weights
weighted_std = np.sqrt(numerator / denominator)
return weighted_std
def rolling_std(s, weights):
window_size = len(weights)
return s.rolling(center=False, window=window_size).apply(lambda win: weighted_std(win, weights))
s = pd.Series(np.random.random([10])) # generate random data
w = np.array([1., 3., 5.]) # choose weights
print(s.values)
print(rolling_std(s, w).values)
Example output:[ 0.08101966 0.57133241 0.29491028 0.25139964 0.26151065 0.45768199
0.94459935 0.21534497 0.35999294 0.60242746]
[ nan nan 0.19701963 0.11936639 0.01539041 0.12097725
0.33346742 0.40784167 0.25884732 0.17709334]
Here lambda win: weighted_std(win, weights)
is a function that takes an array as an input and returns a number. Calculating weighted average in Pandas using NumPy function
Are you looking to group the weighted average by id
?
df.groupby('id').apply(lambda x: np.average(x['b'],weights=x['a'])).reset_index(name='Weighted Average')
Out[1]:
id Weighted Average
0 2 23.878049
1 3 33.166667
2 5 20.975610
Or if you want to do the weighted average of a / b:(df.groupby('id').apply(lambda x: np.average(x['a']/x['b'],weights=x['a']))
.reset_index(name='Weighted Average'))
Out[2]:
id Weighted Average
0 2 1.754146
1 3 1.504274
2 5 1.962528
Calculating weighted mean and standard deviation
R provides weighted mean. In fact, ?weighted.mean shows this example:
## GPA from Siegel 1994
wt <- c(5, 5, 4, 1)/15
x <- c(3.7,3.3,3.5,2.8)
xm <- weighted.mean(x, wt)
One more step:v <- sum(wt * (x - xm)^2)
Numpy std (standard deviation) function weird behavior
The Numpy documentation for std
states:
The standard deviation is the square root of the average of the
squared deviations from the mean, i.e.,std = sqrt(mean(abs(x - x.mean())**2))
.The average squared deviation is normally calculated as
x.sum() / N
,
whereN = len(x)
. If, however, ddof is specified, the divisorN - ddof
is used instead. In standard statistical practice,ddof=1
provides an unbiased estimator of the variance of the infinite
population.ddof=0
provides a maximum likelihood estimate of the
variance for normally distributed variables. The standard deviation
computed in this function is the square root of the estimated
variance, so even withddof=1
, it will not be an unbiased estimate
of the standard deviation per se.Note that, for complex numbers, std takes the absolute value before
squaring, so that the result is always real and nonnegative.For floating-point input, the std is computed using the same precision
the input has. Depending on the input data, this can cause the results
to be inaccurate, especially for float32 (see example below).
Specifying a higher-accuracy accumulator using the dtype keyword can
alleviate this issue.but fora = np.zeros((2, 512*512), dtype=np.float32)
a[0, :] = 1.0
a[1, :] = 0.1 np.std(a)
>>>0.45000005float64
:a = np.zeros((2, 512*512), dtype=np.float64)
a[0, :] = 1.0
a[1, :] = 0.1
np.std(a)
>>>0.45
Related Topics
Binary Numpy Array to List of Integers
Installing Pygraphviz on Windows 10 64-Bit, Python 3.6
How to Extract a Subset of a Colormap as a New Colormap in Matplotlib
Import Text to Pandas with Multiple Delimiters
How to Check Task Status in Celery
Append Dataframe to Excel with Pandas
How to Get a List of Keywords in Python
How to Use a Default Namespace in an Lxml Xpath Query
Lookup Values by Corresponding Column Header in Pandas 1.2.0 or Newer
Overriding the Save Method in Django Modelform
Split Dataframe into Relatively Even Chunks According to Length
How to Delete Specific Strings from a File
Merging Dictionary Value Lists in Python
How to Access Class Member Variables in Python
Numpy Version of "Exponential Weighted Moving Average", Equivalent to Pandas.Ewm().Mean()
Bad Idea to Catch All Exceptions in Python