How to Do Standard Deviation in Ruby

How can I do standard deviation in Ruby?

module Enumerable

    def sum
      self.inject(0){|accum, i| accum + i }
    end

    def mean
      self.sum/self.length.to_f
    end

    def sample_variance
      m = self.mean
      sum = self.inject(0){|accum, i| accum +(i-m)**2 }
      sum/(self.length - 1).to_f
    end

    def standard_deviation
      Math.sqrt(self.sample_variance)
    end

end

Testing it:

a = [ 20, 23, 23, 24, 25, 22, 12, 21, 29 ]
a.standard_deviation  
# => 4.594682917363407

01/17/2012:

fixing "sample_variance" thanks to Dave Sag

How do I find the standard deviation in Ruby?

Use

variance = variance / contents.size # or contents.length

How do I find the average and standard deviation in ruby?

I think the problem comes from splitting the data using \r\n which is OS dependant: if you're on Linux, it should be contents.split('\n'). Either way, you probably would be better off using IO#each to iterate over every line in the file and let Ruby deal with line ending characters.

data = File.open("test.txt", "r+")

count = 0
sum = 0
variance = 0

data.each do |line|
  value = line.split(',')[1]
  sum = sum  + value.to_f
  count += 1
end

avg = sum / count
puts "The average of your large data set is: #{ avg.round(3)} (Answer is rounded to nearest thousandth place)"

# We need to get back to the top of the file
data.rewind

data.each do |line|
  value = line.split(',')[1]
  variance = variance + (value.to_f - avg)**2
end

variance = variance / count
variance = Math.sqrt(variance)
puts "The standard deviation of your large data set is: #{ variance.round(3)} (Answer is rounded to nearest thousandth place)"

Square root and Looping on Standard Deviation in Ruby

The error is because you are giving Math::sqrt a negative number as argument.

To calculate the difference of num and average, use its absolute value:

Math::sqrt((num - average).abs)

iterating over an array of objects and adding to an earlier element...is any more efficient than iterating over a small portion?

stddev is sqrt(variance). Population variance is the mean of the sum of squares of the population. You say you want the running stddev over sublists of 20 elements. So you could calculate this faster by starting by calculating the sum of the squares of the first 20 elements, then iterate through the remaining elements, subtracting the square of the n-20th element and adding the square of the new element and calculating sqrt(current_sum_of_squares/20.0) for the stddev. This will result in about a factor of 20 fewer computations as calculating the stddev independently over N-20 20-element sub-lists.

Pushing the stdev onto the n-20th element is trivial as it doesn't involve any major mutation to the big list, just an append to that one element.

I've gotta run to a meeting now or I'd show some code. Perhaps later tonight if this isn't clear.

Algorithm for deviations

It's a very easy thing to calculate, but you will need to tune one parameter. You want to know if any given value is X standard deviations from the mean. To figure this out, calculate the standard deviation (see Wikipedia), then compare each value's deviation abs(mean - value) from the mean to this value. If a value's deviation is say, more than two standard deviations from the mean, flag it.

Edit:

To track deviations by weekday, keep an array of integers, one for each day. Every time you encounter a deviation, increment that day's counter by one. You could also use doubles and instead maintain a percentage of deviations for that day (num_friday_deviations/num_fridays) for example.

How to find cumulative variance or standard deviation in R

You can also check the wiki site for Algorithms for calculating variance and implement Welford's Online algorithm with Rcpp as follows

library(Rcpp)
func <- cppFunction(
  "arma::vec roll_var(arma::vec &X){
    const arma::uword n_max = X.n_elem;
    double xbar = 0, M = 0; 
    arma::vec out(n_max);
    double *x = X.begin(), *o = out.begin();

    for(arma::uword n = 1; n <= n_max; ++n, ++x, ++o){
      double tmp = (*x - xbar);
      xbar += (*x - xbar) / n;
      M += tmp * (*x - xbar);
      if(n > 1L)
        *o = M / (n - 1.);
    }

    if(n_max > 0)
      out[0] = std::numeric_limits<double>::quiet_NaN();

    return out;
  }", depends = "RcppArmadillo")

# it gives the same
x <- c(1, 4, 5, 6, 9)
drop(func(x))
#R [1]  NaN 4.50 4.33 4.67 8.50
sapply(seq_along(x), function(i) var(x[1:i]))
#R [1]   NA 4.50 4.33 4.67 8.50

# it is fast
x <- rnorm(1e+3)
microbenchmark::microbenchmark(
  func = func(x), 
  sapply = sapply(seq_along(x), function(i) var(x[1:i])))
#R Unit: microseconds
#R    expr      min       lq    mean  median      uq   max neval
#R    func     9.09     9.88    30.7    20.5    21.9  1189   100
#R  sapply 43183.49 45040.29 47043.5 46134.4 47309.7 80345   100

Taking the square root gives you the standard deviation.

A major advantage of this method is that there are no issues with cancellation. E.g., compare

# there are no issues with cancellation
set.seed(99858398)
x <- rnorm(1e2, mean = 1e8)
cumvar <- function (x, sd = FALSE) {
  n <- seq_along(x)
  v <- (cumsum(x ^ 2) - cumsum(x) ^ 2 / n) / (n - 1)
  if (sd) v <- sqrt(v)
  v
}
z1 <- drop(func(x))[-1]
z2 <- cumvar(x)[-1]
plot(z1, ylim = range(z1, z2), type = "l", lwd  = 2)
lines(seq_along(z2), z2, col = "DarkBlue")

Sample Image

The blue line is the algorithm where you subtract the squared values from the squared mean.

Finding standard deviation using only mean, min, max?

A standard deviation cannot in general be computed from just the min, max, and mean. This can be demonstrated with two sets of scores that have the same min, and max, and mean but different standard deviations:

1 2 4 5 : min=1 max=5 mean=3 stdev≈1.5811
1 3 3 5 : min=1 max=5 mean=3 stdev≈0.7071

Also, what does an 'overall score' of 90 mean if the maximum is 84?