Geometric Mean: is there a built-in?
Here is a vectorized, zero- and NA-tolerant function for calculating geometric mean in R. The verbose mean
calculation involving length(x)
is necessary for the cases where x
contains non-positive values.
gm_mean = function(x, na.rm=TRUE){
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
Thanks to @ben-bolker for noting the na.rm
pass-through and @Gregor for making sure it works correctly.
I think some of the comments are related to a false-equivalency of NA
values in the data and zeros. In the application I had in mind they are the same, but of course this is not generally true. Thus, if you want to include optional propagation of zeros, and treat the length(x)
differently in the case of NA
removal, the following is a slightly longer alternative to the function above.
gm_mean = function(x, na.rm=TRUE, zero.propagate = FALSE){
if(any(x < 0, na.rm = TRUE)){
return(NaN)
}
if(zero.propagate){
if(any(x == 0, na.rm = TRUE)){
return(0)
}
exp(mean(log(x), na.rm = na.rm))
} else {
exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
}
Note that it also checks for any negative values, and returns a more informative and appropriate NaN
respecting that geometric mean is not defined for negative values (but is for zeros). Thanks to commenters who stayed on my case about this.
Geometric mean functions returning Inf
The problem is that when you do prod(n)
in your function, it calculates the result of this call before raising it to the power of (1/length(n))
. Since the mean of x
is about 100, you can expect this call to return a value with a similar order of magnitude to 100^1000, which is much higher than the maximum number that R will return (R will call anything above around 10^308 Inf
).
Any mathematical operation you attempt on Inf
will also return Inf
, so your naive implementation will not work if x
is greater than about 154:
100^154
#> [1] 1e+308
100^155
#> [1] Inf
In actuality, because the majority of numbers are less than 100 in your sample, you might get to an x length of about 180 before you started generating Inf
In any case, it would be safer to stick to
gmean <- function(n) exp(sum(log(n))/length(n))
Geometric mean in SAS for each column of a data set
*Compute log value of x1, x2, x3;
data tab1;
set have;
logx1=log(x1);
logx2=log(x2);
logx3=log(x3);
run;
*Compute SD of logx1, logx2, logx3;
proc means data=tab1 noprint;
var logx1-logx3;
output out=tab2 n=n1-n3 std=std1-std3;
run;
*Compute logcv using formula;
data tab3;
set tab2;
logcv1=100*(exp(std1**2)-1)**0.5;
logcv2=100*(exp(std2**2)-1)**0.5;
logcv3=100*(exp(std3**2)-1)**0.5;
putlog 'NOTE: ' logcv1= logcv2= logcv3=;
run;
The result is show in log window:
NOTE: logcv1=18.155613536 logcv2=48.09165987 logcv3=32.538955751
It not much diffcult to do caculation in SAS, just try to do it step by step and you will get it.
Can I calculate the geometric mean with django annotation?
You can do something quite similar: you can calculate the average of the natural logarithms:
from django.db.models import Avg
from django.db.models.functions import Ln
Foo.objects.annotate(geometric_mean=Avg(Ln('bars__value')))
then the geometric mean is the exponent of this. If you really need the geometric mean itself, then you can use Exp
[Django-doc] on that result:
from django.db.models import Avg
from django.db.models.functions import Exp, Ln
Foo.objects.annotate(geometric_mean=Exp(Avg(Ln('bars__value'))))
but you do not need this if you for example only want to order by two geometric means, since the exponent leaves the order relation intact: ea≤eb implies a≤b and vice versa.
This works because ln(x1×x2×…×xn) is equivalent to ln(x1)+ln(x2)+…+ln(xn), and ln(xy) is equivalent to y×ln(x).
data.table rolling/windowed geometric mean
Just using your own code with package psych
function geometric.mean
mydata[, (my_name_list) := unlist(lapply(.SD,
function(x) rollapply(x,
5,
geometric.mean,
na.pad = TRUE)),recursive = F),
.SDcols = my_col_list]
Safe computation of Geometric Mean
In general, in a sequence of floating-point operations that also involves contracting operations such as square root or cube root, it is advantageous from an accuracy perspective to perform the contracting operations last. For example, sqrt(1.0/x)
is more accurate than 1.0/sqrt(x)
, sqrt(a*b)
is more accurate than sqrt(a)*sqrt(b)
, and cbrt(a*b*c)
is more accurate than cbrt(a)*cbrt(b)*cbrt(c)
.
As a consequence, unless there is a danger of overflowing or underflowing the chosen floating-point format, such as IEEE-754 binary64
(e.g. double
in C/C++), in intermediate computation, method [2] should be chosen. Additional aspect relevant to accuracy: if n-th root is computed by exponentiation, such as pow()
in C/C++, additional error will be introduced with every computed root, as explained in case of cube root in my answer to this question. Finally, the computation of the n-th root will be slower than a multiplication, so doing only multiplies with a final root computation at the end will also be a superior approach performancewise.
Very accurate results can be achieved with method [2] by using a compensated product (akin to the compensated addition provided by Kahan summation). See the following paper for details:
Stef Graillat, "Accurate Floating-Point Product and Exponentiation", IEEE Transactions on Computers, Vol. 58, No. 7, July 2009, pp. 994-1000 (online)
This compensated product can be computed particularly efficient on systems that provide the FMA (fused multiply-add) operation in hardware. This is the case for all the common modern processor architectures, both CPUs and GPUs. C/C++ provide convenient access to this via the standard math functions fma()
, fmaf()
.
Update: Asker clarified in comment that the risk of underflow is imminent since there are on the order of 108 factors in [10-6, 10-1]. One possible workaround mentioned by @Yves Daoust in a comment is to separate the factors into mantissa and exponent and accumulate them separately. Whether this is practical will depend on the floating-point environment. While C and C++ provide the standard function frexp()
for performing this splitting, this function may not be very fast.
Calculating geometric mean every 10 min using dplyr or aggregte function
One option is use lubridate::floor_date
function to create group for every 10 mins round the clock. All data between 20-30 mins will be grouped as 20th mins and so on.
library(dplyr)
library(lubridate)
mydata %>% mutate(TimeDate = as.POSIXct(TimeDate, format = "%Y-%m-%d %H:%M")) %>%
group_by(Diff_10 = floor_date(TimeDate, "10minute")) %>%
summarise(Geo.Mean=exp(mean(log(diam))),
Geo.SD=exp(sd(log(diam))))
# # A tibble: 1 x 3
# Diff_10 Geo.Mean Geo.SD
# <dttm> <dbl> <dbl>
# 1 2016-05-11 08:20:00 125 1.28
#Result with modified data
# # A tibble: 6 x 3
# Diff_10 Geo.Mean Geo.SD
# <dttm> <dbl> <dbl>
# 1 2016-05-11 08:20:00 118 1.14
# 2 2016-05-11 08:30:00 141 1.69
# 3 2016-05-11 08:40:00 127 1.16
# 4 2016-05-11 08:50:00 150 1.28
# 5 2016-05-11 09:10:00 98.0 1.00
# 6 2016-05-11 09:20:00 115 1.29
cut
can be used if groups data to be grouped every 10 mins from starting time. In OP
, groups will be as 2016-05-11 08:25
, 2016-05-11 08:35
and so on.
Modified OP's data:
mydata <- read.table(text =
"TimeDate diam ratio
'2016-05-11 8:25' 134.491 1.83074
'2016-05-11 8:25' 117.777 1.34712
'2016-05-11 8:25' 104.27 0.927635
'2016-05-11 8:35' 204.085 1.43079
'2016-05-11 8:35' 96.8011 0.991716
'2016-05-11 8:42' 119.152 1.09884
'2016-05-11 8:45' 113.871 0.932493
'2016-05-11 8:46' 150.468 0.710525
'2016-05-11 8:56' 116.576 1.11207
'2016-05-11 8:56' 192.257 1.61558
'2016-05-11 8:56' 128.071 0.756608
'2016-05-11 8:59' 177.667 0.73309
'2016-05-11 9:17' 97.7377 0.862858
'2016-05-11 9:17' 98.3195 1.00681
'2016-05-11 9:27' 91.3603 0.95051
'2016-05-11 9:27' 152.95 0.842145
'2016-05-11 9:27' 133.125 1.28365
'2016-05-11 9:27' 95.2516 0.573588",
header = TRUE, stringsAsFactors = FALSE)
Related Topics
Align Ggplot2 Plots Vertically
Reverse Order of Discrete Y Axis in Ggplot2
Why Does Unlist() Kill Dates in R
Adding Space Between Bars in Ggplot2
Re-Ordering Factor Levels in Data Frame
How to Create a Loop That Includes Both a Code Chunk and Text with Knitr in R
Reduce PDF File Size of Plots by Filtering Hidden Objects
Generate an Incrementally Increasing Sequence Like 112123123412345
Extract Matrix Column Values by Matrix Column Name
How to Find All Functions in an R Package
Installation of Rodbc/Roracle Packages on Os X Mavericks
When Should I Use the := Operator in Data.Table
Split Text String in a Data.Table Columns
How to Add Hatches, Stripes or Another Pattern or Texture to a Barplot in Ggplot
Simplest Way to Get Rbind to Ignore Column Names
Write Many Files in a for Loop
R - Add Column That Counts Sequentially Within Groups But Repeats for Duplicates