How to Efficiently Calculate a Running Standard Deviation

How to efficiently calculate a running standard deviation

The answer is to use Welford's algorithm, which is very clearly defined after the "naive methods" in:

  • Wikipedia: Algorithms for calculating variance

It's more numerically stable than either the two-pass or online simple sum of squares collectors suggested in other responses. The stability only really matters when you have lots of values that are close to each other as they lead to what is known as "catastrophic cancellation" in the floating point literature.

You might also want to brush up on the difference between dividing by the number of samples (N) and N-1 in the variance calculation (squared deviation). Dividing by N-1 leads to an unbiased estimate of variance from the sample, whereas dividing by N on average underestimates variance (because it doesn't take into account the variance between the sample mean and the true mean).

I wrote two blog entries on the topic which go into more details, including how to delete previous values online:

  • Computing Sample Mean and Variance Online in One Pass
  • Deleting Values in Welford’s Algorithm for Online Mean and Variance

You can also take a look at my Java implement; the javadoc, source, and unit tests are all online:

  • Javadoc: stats.OnlineNormalEstimator
  • Source: stats.OnlineNormalEstimator.java
  • JUnit Source: test.unit.stats.OnlineNormalEstimatorTest.java
  • LingPipe Home Page

Quicker Mean and Standard Deviation Calculation for Large Array

Use sliding_window_view:

window_view = np.lib.stride_tricks.sliding_window_view(pixel_array, (5, 5))

window_view_mean = window_view.mean(axis=(2, 3))
window_view_std = window_view.std(axis=(2, 3))

How to efficiently calculate a moving Standard Deviation

The answer is yes, you can. In the mid-80's I developed just such an algorithm (probably not original) in FORTRAN for a process monitoring and control application. Unfortunately, that was over 25 years ago and I do not remember the exact formulas, but the technique was an extension of the one for moving averages, with second order calculations instead of just linear ones.


After looking at your code some, I am think that I can suss out how I did it back then. Notice how your inner loop is making a Sum of Squares?:

            for (int x = i; x > (i - period); x--)
{
total_bollinger += Math.Pow(data.Values[x]["close"] - average, 2);
}

in much the same way that your average must have originally had a Sum of Values? The only two differences are the order (its power 2 instead of 1) and that you are subtracting the average each value before you square it. Now that might look inseparable, but in fact they can be separated:

SUM(i=1; n){ (v[i] - k)^2 }

is

SUM(i=1..n){v[i]^2 -2*v[i]*k + k^2}

which becomes

SUM(i=1..n){v[i]^2 -2*v[i]*k} + k^2*n

which is

SUM(i=1..n){v[i]^2} + SUM(i=1..n){-2*v[i]*k} + k^2*n

which is also

SUM(i=1..n){v[i]^2} + SUM(i=1..n){-2*v[i]}*k + k^2*n

Now the first term is just a Sum of Squares, you handle that in the same way that you do the sum of Values for the average. The last term (k^2*n) is just the average squared times the period. Since you divide the result by the period anyway, you can just add the new average squared without the extra loop.

Finally, in the second term (SUM(-2*v[i]) * k), since SUM(v[i]) = total = k*n you can then change it into this:

-2 * k * k * n

or just -2*k^2*n, which is -2 times the average squared, once the period (n) is divided out again. So the final combined formula is:

SUM(i=1..n){v[i]^2} - n*k^2

or

SUM(i=1..n){values[i]^2} - period*(average^2)

(be sure to check the validity of this, since I am deriving it off the top of my head)

And incorporating into your code should look something like this:

public static void AddBollingerBands(ref SortedList<DateTime, Dictionary<string, double>> data, int period, int factor)
{
double total_average = 0;
double total_squares = 0;

for (int i = 0; i < data.Count(); i++)
{
total_average += data.Values[i]["close"];
total_squares += Math.Pow(data.Values[i]["close"], 2);

if (i >= period - 1)
{
double total_bollinger = 0;
double average = total_average / period;

double stdev = Math.Sqrt((total_squares - Math.Pow(total_average,2)/period) / period);
data.Values[i]["bollinger_average"] = average;
data.Values[i]["bollinger_top"] = average + factor * stdev;
data.Values[i]["bollinger_bottom"] = average - factor * stdev;

total_average -= data.Values[i - period + 1]["close"];
total_squares -= Math.Pow(data.Values[i - period + 1]["close"], 2);
}
}
}

calculating mean and standard deviation of the data which does not fit in memory using python

There is a simple online algorithm that computes both the mean and the variance by looking at each datapoint once and using O(1) memory.

Wikipedia offers the following code:

def online_variance(data):
n = 0
mean = 0
M2 = 0

for x in data:
n = n + 1
delta = x - mean
mean = mean + delta/n
M2 = M2 + delta*(x - mean)

variance = M2/(n - 1)
return variance

This algorithm is also known as Welford's method. Unlike the method suggested in the other answer, it can be shown to have nice numerical properties.

Take the square root of the variance to get the standard deviation.

How to efficiently calculate average and standard deviation in pyspark

Assuming you dataframe df with the following schema and rows

df.printSchema()
df.show()

Sample Image

You can get your required use case in one line

df.groupBy('date').agg(F.mean("duration"),F.stddev("duration")).show()

Sample Image

Calculates a standard deviation columns for timedelta elements

Use GroupBy.std and pd.to_timedelta

total_seconds = \
pd.to_timedelta(
df['visit_time'].dt.total_seconds()
.groupby([df['ID'], df['country_ID']]).std(),
unit='S').unstack().fillna(pd.Timedelta(days=0))
print(total_seconds)
country_ID ENG ESP
ID
0 0 days 00:00:00 3 days 19:55:25.973595304
1 0 days 17:43:29.315934274 0 days 00:00:00
2 0 days 00:00:00 0 days 00:00:00


Related Topics



Leave a reply



Submit