Standard Deviation in R Seems to Be Returning the Wrong Answer - am I Doing Something Wrong

Standard Deviation in R Seems to be Returning the Wrong Answer - Am I Doing Something Wrong?

Try this

R> sd(c(2,4,4,4,5,5,7,9)) * sqrt(7/8)
[1] 2
R>

and see the rest of the Wikipedia article for the discussion about estimation of standard deviations. Using the formula employed 'by hand' leads to a biased estimate, hence the correction of sqrt((N-1)/N). Here is a key quote:

The term standard deviation of the
sample is used for the uncorrected
estimator (using N) while the term
sample standard deviation is used for
the corrected estimator (using N − 1).
The denominator N − 1 is the number of
degrees of freedom in the vector of
residuals, .

aggreate function not working with standard deviation and factors

You are passing the entire dataframe as x argument. Instead you should pass variables that you want to aggregate. There are two ways you can use the aggregate function.

  1. Using values -
aggregate(x = df$Anzahl, 
by = list(df$Fall, df$DRG, df$DRG2),
FUN = mean, na.rm = TRUE)

  1. Using formula syntax :
aggregate(Anzahl~ Fall + DRG + DRG2, df, FUN = mean, na.rm = TRUE)

The same would work with sd function as well.


In your attempt mean/sd of all the values will be calculated. The output of mean and sd is different when passed factor variables.

mean(df$Fall)
#[1] NA

but returns a warning

Warning message:
In mean.default(df$Fall) : argument is not numeric or logical: returning NA

whereas sd returns an error.

sd(df$Fall)

Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
Calling var(x) on a factor x is defunct.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.

Hence, mean seems to work whereas sd returns an error.

Getting NA when I run a standard deviation

Try sd(data$var, na.rm=TRUE) and then any NAs in the column var will be ignored. Will also pay to check out your data to make sure the NA's should be NA's and there haven't been read in errors, commands like head(data), tail(data), and str(data) should help with that.

Variance result differ in R language


  • var in R uses the unbiased estimator of the variance (sample variance) which has a denominator of n-1.

  • Your calculation uses the formula of variance.

Check this:

vec <- 1:100

#var uses the sample variance where the denominator is n-1 i.e. 99
var(vec)
#[1] 841.6667
1 / 99 * sum((vec - mean(vec))^2)
#[1] 841.6667

#this is what you use to calculate variance, which uses a denominator of n i.e. 100
mean(vec^2) - mean(vec)^2
#[1] 833.25
1 / 100 * sum((vec - mean(vec))^2)
#[1] 833.25

R, bit64, problems calculating row mean and standard deviation in data.table

As a short and first recommendation to most readers: do use 'double' instead of 'integer64' unless you have a specific reason to use 64bit integers. 'double' is an R internal datatype, while 'integer64' is a package extension datatype, which is represented as a 'double' vector with a class attribute 'integer64', i.e. each elements 64 bits are interpreted as 64bit integer by code that knows about this class. Unfortunately many core R functions do not know about 'integer64', which then easily leads to wrong results. Hence coercing to 'double'

dtind <- dtin
for (i in seq_along(dtind))
dtind[[i]] <- as.double(dtind[[i]])
b <- apply(dtind, 1, mean)

will give the somewhat expected result

> b
[1] 9.956667e+02 9.893733e+04 9.378069e+06 1.006857e+09 1.032090e+11 9.424525e+12 1.015007e+15 1.050195e+17

although not exactly what you expected, neither looking at the rounded differences

> b - dt$expected_row_mean
integer64
[1] -1 0 -1 -1 0 -1 -3 -392

nor looking at the unrounded differences

> b - as.double(dt$expected_row_mean)
[1] -0.3333333 0.3333333 -0.3333333 -0.1666666 0.1666718 -0.3339844 -2.8750000 -384.0000000
Warnmeldung:
In as.double.integer64(dt$expected_row_mean) :
integer precision lost while converting to double

OK, let's assume you truly want integer64 because your largest numbers are beyond the integer precision 2^52 of doubles. Then your problem starts with the fact that 'apply' does not know about integer64 and actually destroys the 'integer64' class attribute:

> apply(dtin, 1, is.integer64)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

It actually destroys the 'integer64' class attribute twice, once in preparing the inputs and once when postprocess the outputs. We can fix this by

c <- apply(dtin, 1, function(x){
oldClass(x) <- "integer64" # fix
mean(x) # note that this dispatches to mean.integer64
})
oldClass(c) <- "integer64" # fix again

Now the result looks reasonable

> c
integer64
[1] 995 98937 9378068 1006857435 103208970152 9424525034851 1015007051886437 105019453390704600

but still is not what you expected

> c - dt$expected_row_mean
integer64
[1] -1 0 -1 -1 0 -1 -3 -400

The small differences (-1) are due to rounding, since the floating mean

> b[1]
[1] 995.6667

and you assume

> dt$expected_row_mean[1]
integer64
[1] 996

while mean.integer64 coerces (truncates) to integer64. This behavior of mean.integer64 is debatable, however, at least consistent:

x <- seq(0, 1, 0.25)
> data.frame(x=x, y=as.integer64(0) + x)
x y
1 0.00 0
2 0.25 0
3 0.50 0
4 0.75 0
5 1.00 1
> mean(as.integer64(0:1))
integer64
[1] 0

The topic of rounding makes clear that implementing sd.integer64 would be even more debatable. Should it return integer64 or double?

Regarding the bigger differences it is unclear what the rationale of your expectation is: taking the seventh row of your table and substracting its minimum

x <- (unlist(dtin[7,]))
oldClass(x) <- "integer64"
y <- min(x)
z <- as.double(x - y)

gives numbers in a range where 'double' precisely handles integers

> log2(z)
[1] 43.73759 -Inf 42.98975 45.47960 46.03745 44.92326

averaging those and comparing against your expectation still gives a difference not explained by rounding

> mean(z) - as.double(dt$expected_row_mean[7] - y)
[1] -2.832031


Related Topics



Leave a reply



Submit