Calculating Weighted Mean and Standard Deviation

Calculating weighted mean and standard deviation

R provides weighted mean. In fact, ?weighted.mean shows this example:

 ## GPA from Siegel 1994
wt <- c(5, 5, 4, 1)/15
x <- c(3.7,3.3,3.5,2.8)
xm <- weighted.mean(x, wt)

One more step:

v <- sum(wt * (x - xm)^2)

Calculate Weighted Average and Weighted Standard Deviation in DAX

Weighted Average and Standard deviation can be implemented in DAX according to their mathematical definition.

Assuming we have a table with the columns Weight and Value the formula for the Weighted Average is

WAvg = 
VAR Num = SUMX( Samples, Samples[Weight] * Samples[Value] )
VAR Den = SUM( Samples[Weight] )
RETURN DIVIDE( Num, Den )

and the formula for the Weighted Standard Deviation is

WStdDev = 
VAR WAvg = [WAvg]
VAR Num = SUMX( Samples, Samples[Weight] * (Samples[Value] - Wavg)^2 )
VAR Den = SUM( Samples[Weight] )
VAR WVar = DIVIDE( Num, Den )
RETURN SQRT(WVar)

Edit:
if I understand your new request, the Weight is the number of rows with the same Set Value, that is to be used for each of the Actual Value. Then, since there are two pairs of columns, I assume that the requirement is to have a set of measures per each couple of columns.

The formula requires to add a count of the number of rows per each group of Set Value, to be used as weight. I imported the sample table as table "V"

Weighted average for Set Value 1 and Actual Value 1

WAvg1 = 
VAR Num =
SUMX(
ALL( V ),
CALCULATE( COUNTROWS( V ), ALLEXCEPT( V, V[Set Value 1] ) ) * V[Actual Value 1]
)
VAR Den =
SUMX(
ALL( V ),
CALCULATE( COUNTROWS( V ), ALLEXCEPT( V, V[Set Value 1] ) )
)
RETURN
DIVIDE( Num, Den )

Weighted average for Set Value 2 and Actual Value 2

WAvg2 = 
VAR Num =
SUMX(
ALL( V ),
CALCULATE( COUNTROWS( V ), ALLEXCEPT( V, V[Set Value 2] ) ) * V[Actual Value 2]
)
VAR Den =
SUMX(
ALL( V ),
CALCULATE( COUNTROWS( V ), ALLEXCEPT( V, V[Set Value 2] ) )
)
RETURN
DIVIDE( Num, Den )

Weighted standard deviation for Set Value 1 and Actual Value 1

WStdDev1 = 
VAR Num =
SUMX(
ALL( V ),
VAR WAvg = [WAvg1]
RETURN
CALCULATE( COUNTROWS( V ), ALLEXCEPT( V, V[Set Value 1] ) ) * ( V[Actual Value 1] - WAvg ) ^ 2
)
VAR Den =
SUMX(
ALL( V ),
CALCULATE( COUNTROWS( V ), ALLEXCEPT( V, V[Set Value 1] ) )
)
VAR WVariance =
DIVIDE( Num, Den )
RETURN
SQRT( WVariance )

Weighted standard deviation for Set Value 2 and Actual Value 2

WStdDev2 = 
VAR Num =
SUMX(
ALL( V ),
VAR WAvg = [WAvg2]
RETURN
CALCULATE( COUNTROWS( V ), ALLEXCEPT( V, V[Set Value 2] ) ) * ( V[Actual Value 2] - WAvg ) ^ 2
)
VAR Den =
SUMX(
ALL( V ),
CALCULATE( COUNTROWS( V ), ALLEXCEPT( V, V[Set Value 2] ) )
)
VAR WVariance =
DIVIDE( Num, Den )
RETURN
SQRT( WVariance )

Applying these formulas to the sample table we get these results

visual card with results

How do I calculate the standard deviation between weighted measurements?

I just found this wikipedia page discussing data of equal significance vs weighted data. The correct way to calculate the biased weighted estimator of variance is

,

though the following, on-the-fly implementation, is more efficient computationally as it does not require calculating the weighted average before looping over the sum on the weighted differences squared

.

Despite my skepticism, I tried both and got the exact same results.

Note, be sure to use the weighted average

.

Getting weighted average and standard deviation on several columns in Pandas

You could use EOL's NumPy-based code
to calculate weighted averages and standard deviation. To use this in a Pandas groupby/apply operation, make weighted_average_std return a DataFrame:

import numpy as np
import pandas as pd

def weighted_average_std(grp):
"""
Based on http://stackoverflow.com/a/2415343/190597 (EOL)
"""
tmp = grp.select_dtypes(include=[np.number])
weights = tmp['Weight']
values = tmp.drop('Weight', axis=1)
average = np.ma.average(values, weights=weights, axis=0)
variance = np.dot(weights, (values - average) ** 2) / weights.sum()
std = np.sqrt(variance)
return pd.DataFrame({'mean':average, 'std':std}, index=values.columns)

np.random.seed(0)
df = pd.DataFrame({
"Date": pd.date_range(start='2018-01-01', end='2018-01-03 18:00:00', freq='6H'),
"Weight": np.random.uniform(3, 5, 12),
"V1": np.random.uniform(10, 15, 12),
"V2": np.random.uniform(10, 15, 12),
"V3": np.random.uniform(10, 15, 12)})

df.index = df["Date"]
df_agg = df.groupby(pd.Grouper(freq='1D')).apply(weighted_average_std).unstack(-1)
print(df_agg)

yields

                 mean                             std                    
V1 V2 V3 V1 V2 V3
Date
2018-01-01 12.105253 12.314079 13.566136 1.803014 1.725761 0.679279
2018-01-02 13.223172 12.534893 11.860456 1.709583 0.950338 1.153895
2018-01-03 13.782625 12.013557 12.105231 0.969099 1.189149 1.249064

How can I calculate weighted standard errors and plot them in a bar plot?

There isn't (as far as I know) a built-in R function to calculate the standard error of a weighted mean, but it is fairly straightforward to calculate - with some assumptions that are probably valid in the case you describe.

See, for instance:
https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Standard_error

Standard error of the weighted mean

If the elements used to calculate the weighted mean are samples from populations that all have the same variance v, then the variance of the weighted sample mean is estimated as:

var_m = v^2 * sum( wnorm^2 )   # wnorm = weights normalized to sum to 1

And the standard error of the weighted mean is equal to the square root of the variance.

sem = sqrt( var_m )

So, we need to calculate the sample variance from the weighted data.

Weighted variance

The weighted population variance (or biased sample variance) is calculated as:

pop_v = sum( w * (x-mean)^2 ) / sum( w )

However, if (as in the case you describe), we are working with samples taken from the population, rather then with the population itself, we need to make an adjustment to obtain an unbiased sample variance.

If the weights represent the frequencies of observations underlying each of the elements used to calculate the weighted mean & variance, then the adjustment is:

v = pop_v * sum( w ) / ( sum( w ) -1 )

However, this is not the case here, as the weights are the total frequenceis of 911 calls for each policeman, not the calls for each beat. So in this case the weights correspond to the reliabilities of each element, and the adjustment is:

v = pop_v * sum( w )^2 / ( sum( w )^2 - sum( w^2) ) 

weighted.var and weighted.sem functions

Putting all this together, we can define weighted.var and weighted.sem functions, similar to the base R weighted.mean function (note that several R packages, for instance "Hmisc", already include more-versatile functions to calculate the weighted variance):

weighted.var = function(x,w,type="reliability") {
m=weighted.mean(x,w)
if(type=="frequency"){ return( sum(w*(x-m)^2)/(sum(w)-1) ) }
else { return( sum(w*(x-m)^2)*sum(w)/(sum(w)^2-sum(w^2)) ) }
}
weighted.sem = function(x,w,...) { return( sqrt(weighted.var(x,w,...)*sum(w^2)/sum(w)^2) ) }

applied to 911 call data in the question

In the case of the question, the elements from which we want to calculate the weighted mean and weighted sem correspond to the proportions of calls in each beat, for each policeman.

So (finally...):

props = t(apply(df,1,function(row) row[-(1:3)]/row[3]))
wmean_props = apply(props,2,function(col) weighted.mean(col,w=df[,3]))
wsem_props = apply(props,2,function(col) weighted.sem(col,w=df[,3]))

Weighted standard deviation in NumPy

How about the following short "manual calculation"?

def weighted_avg_and_std(values, weights):
"""
Return the weighted average and standard deviation.

values, weights -- Numpy ndarrays with the same shape.
"""
average = numpy.average(values, weights=weights)
# Fast and numerically precise:
variance = numpy.average((values-average)**2, weights=weights)
return (average, math.sqrt(variance))


Related Topics



Leave a reply



Submit