Geometric Mean: Is There a Built-In

Geometric Mean: is there a built-in?

Here is a vectorized, zero- and NA-tolerant function for calculating geometric mean in R. The verbose mean calculation involving length(x) is necessary for the cases where x contains non-positive values.

gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}

Thanks to @ben-bolker for noting the na.rm pass-through and @Gregor for making sure it works correctly.

I think some of the comments are related to a false-equivalency of NA values in the data and zeros. In the application I had in mind they are the same, but of course this is not generally true. Thus, if you want to include optional propagation of zeros, and treat the length(x) differently in the case of NA removal, the following is a slightly longer alternative to the function above.

gm_mean = function(x, na.rm=TRUE, zero.propagate = FALSE){
if(any(x < 0, na.rm = TRUE)){
return(NaN)
}
if(zero.propagate){
if(any(x == 0, na.rm = TRUE)){
return(0)
}
exp(mean(log(x), na.rm = na.rm))
} else {
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
}

Note that it also checks for any negative values, and returns a more informative and appropriate NaN respecting that geometric mean is not defined for negative values (but is for zeros). Thanks to commenters who stayed on my case about this.

Geometric mean functions returning Inf

The problem is that when you do prod(n) in your function, it calculates the result of this call before raising it to the power of (1/length(n)). Since the mean of x is about 100, you can expect this call to return a value with a similar order of magnitude to 100^1000, which is much higher than the maximum number that R will return (R will call anything above around 10^308 Inf).

Any mathematical operation you attempt on Inf will also return Inf, so your naive implementation will not work if x is greater than about 154:

100^154
#> [1] 1e+308

100^155
#> [1] Inf

In actuality, because the majority of numbers are less than 100 in your sample, you might get to an x length of about 180 before you started generating Inf

In any case, it would be safer to stick to

gmean <- function(n) exp(sum(log(n))/length(n))

Geometric mean in SAS for each column of a data set

*Compute log value of x1, x2, x3;
data tab1;
set have;

logx1=log(x1);
logx2=log(x2);
logx3=log(x3);
run;

*Compute SD of logx1, logx2, logx3;
proc means data=tab1 noprint;
var logx1-logx3;
output out=tab2 n=n1-n3 std=std1-std3;
run;

*Compute logcv using formula;
data tab3;
set tab2;

logcv1=100*(exp(std1**2)-1)**0.5;
logcv2=100*(exp(std2**2)-1)**0.5;
logcv3=100*(exp(std3**2)-1)**0.5;
putlog 'NOTE: ' logcv1= logcv2= logcv3=;
run;

The result is show in log window:

NOTE: logcv1=18.155613536 logcv2=48.09165987 logcv3=32.538955751

It not much diffcult to do caculation in SAS, just try to do it step by step and you will get it.

Can I calculate the geometric mean with django annotation?

You can do something quite similar: you can calculate the average of the natural logarithms:

from django.db.models import Avg
from django.db.models.functions import Ln

Foo.objects.annotate(geometric_mean=Avg(Ln('bars__value')))

then the geometric mean is the exponent of this. If you really need the geometric mean itself, then you can use Exp [Django-doc] on that result:

from django.db.models import Avg
from django.db.models.functions import Exp, Ln

Foo.objects.annotate(geometric_mean=Exp(Avg(Ln('bars__value'))))

but you do not need this if you for example only want to order by two geometric means, since the exponent leaves the order relation intact: ea≤eb implies a≤b and vice versa.

This works because ln(x1×x2×…×xn) is equivalent to ln(x1)+ln(x2)+…+ln(xn), and ln(xy) is equivalent to y×ln(x).

data.table rolling/windowed geometric mean

Just using your own code with package psych function geometric.mean

mydata[, (my_name_list) := unlist(lapply(.SD,
function(x) rollapply(x,
5,
geometric.mean,
na.pad = TRUE)),recursive = F),
.SDcols = my_col_list]

Safe computation of Geometric Mean

In general, in a sequence of floating-point operations that also involves contracting operations such as square root or cube root, it is advantageous from an accuracy perspective to perform the contracting operations last. For example, sqrt(1.0/x) is more accurate than 1.0/sqrt(x), sqrt(a*b) is more accurate than sqrt(a)*sqrt(b), and cbrt(a*b*c) is more accurate than cbrt(a)*cbrt(b)*cbrt(c).

As a consequence, unless there is a danger of overflowing or underflowing the chosen floating-point format, such as IEEE-754 binary64 (e.g. double in C/C++), in intermediate computation, method [2] should be chosen. Additional aspect relevant to accuracy: if n-th root is computed by exponentiation, such as pow() in C/C++, additional error will be introduced with every computed root, as explained in case of cube root in my answer to this question. Finally, the computation of the n-th root will be slower than a multiplication, so doing only multiplies with a final root computation at the end will also be a superior approach performancewise.

Very accurate results can be achieved with method [2] by using a compensated product (akin to the compensated addition provided by Kahan summation). See the following paper for details:

Stef Graillat, "Accurate Floating-Point Product and Exponentiation", IEEE Transactions on Computers, Vol. 58, No. 7, July 2009, pp. 994-1000 (online)

This compensated product can be computed particularly efficient on systems that provide the FMA (fused multiply-add) operation in hardware. This is the case for all the common modern processor architectures, both CPUs and GPUs. C/C++ provide convenient access to this via the standard math functions fma(), fmaf().

Update: Asker clarified in comment that the risk of underflow is imminent since there are on the order of 108 factors in [10-6, 10-1]. One possible workaround mentioned by @Yves Daoust in a comment is to separate the factors into mantissa and exponent and accumulate them separately. Whether this is practical will depend on the floating-point environment. While C and C++ provide the standard function frexp() for performing this splitting, this function may not be very fast.

Calculating geometric mean every 10 min using dplyr or aggregte function

One option is use lubridate::floor_date function to create group for every 10 mins round the clock. All data between 20-30 mins will be grouped as 20th mins and so on.

library(dplyr)
library(lubridate)

mydata %>% mutate(TimeDate = as.POSIXct(TimeDate, format = "%Y-%m-%d %H:%M")) %>%
group_by(Diff_10 = floor_date(TimeDate, "10minute")) %>%
summarise(Geo.Mean=exp(mean(log(diam))),
Geo.SD=exp(sd(log(diam))))

# # A tibble: 1 x 3
# Diff_10 Geo.Mean Geo.SD
# <dttm> <dbl> <dbl>
# 1 2016-05-11 08:20:00 125 1.28

#Result with modified data
# # A tibble: 6 x 3
# Diff_10 Geo.Mean Geo.SD
# <dttm> <dbl> <dbl>
# 1 2016-05-11 08:20:00 118 1.14
# 2 2016-05-11 08:30:00 141 1.69
# 3 2016-05-11 08:40:00 127 1.16
# 4 2016-05-11 08:50:00 150 1.28
# 5 2016-05-11 09:10:00 98.0 1.00
# 6 2016-05-11 09:20:00 115 1.29

cut can be used if groups data to be grouped every 10 mins from starting time. In OP, groups will be as 2016-05-11 08:25, 2016-05-11 08:35 and so on.

Modified OP's data:

mydata <- read.table(text = 
"TimeDate diam ratio
'2016-05-11 8:25' 134.491 1.83074
'2016-05-11 8:25' 117.777 1.34712
'2016-05-11 8:25' 104.27 0.927635
'2016-05-11 8:35' 204.085 1.43079
'2016-05-11 8:35' 96.8011 0.991716
'2016-05-11 8:42' 119.152 1.09884
'2016-05-11 8:45' 113.871 0.932493
'2016-05-11 8:46' 150.468 0.710525
'2016-05-11 8:56' 116.576 1.11207
'2016-05-11 8:56' 192.257 1.61558
'2016-05-11 8:56' 128.071 0.756608
'2016-05-11 8:59' 177.667 0.73309
'2016-05-11 9:17' 97.7377 0.862858
'2016-05-11 9:17' 98.3195 1.00681
'2016-05-11 9:27' 91.3603 0.95051
'2016-05-11 9:27' 152.95 0.842145
'2016-05-11 9:27' 133.125 1.28365
'2016-05-11 9:27' 95.2516 0.573588",
header = TRUE, stringsAsFactors = FALSE)


Related Topics



Leave a reply



Submit