How to Repeat the Grubbs Test and Flag the Outliers

How to repeat the Grubbs test and flag the outliers

Looks like you would need a short function to do what you want:

library(outliers)
library(ggplot2)

X <- c(152.36,130.38,101.54,96.26,88.03,85.66,83.62,76.53,
       74.36,73.87,73.36,73.35,68.26,65.25,63.68,63.05,57.53)

grubbs.flag <- function(x) {
  outliers <- NULL
  test <- x
  grubbs.result <- grubbs.test(test)
  pv <- grubbs.result$p.value
  while(pv < 0.05) {
    outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3]))
    test <- x[!x %in% outliers]
    grubbs.result <- grubbs.test(test)
    pv <- grubbs.result$p.value
  }
  return(data.frame(X=x,Outlier=(x %in% outliers)))
}

Here's the output:

grubbs.flag(X)
         X Outlier
1   152.36    TRUE
2   130.38    TRUE
3   101.54   FALSE
4    96.26   FALSE
5    88.03   FALSE
6    85.66   FALSE
7    83.62   FALSE
8    76.53   FALSE
9    74.36   FALSE
10   73.87   FALSE
11   73.36   FALSE
12   73.35   FALSE
13   68.26   FALSE
14   65.25   FALSE
15   63.68   FALSE
16   63.05   FALSE
17   57.53   FALSE

And if you want a histogram with different colors, you can use the following:

ggplot(grubbs.flag(X),aes(x=X,color=Outlier,fill=Outlier))+
  geom_histogram(binwidth=diff(range(X))/30)+
  theme_bw()

Outlier Histogram

Several Grubbs tests simultaneously in R

You can use lapply:

library(outliers)

df = data.frame(a=runif(20),b=runif(20),c=runif(20))
tests = lapply(df,grubbs.test) 
# or with parameters:
tests = lapply(df,grubbs.test,opposite=T)

Results:

> tests
$a

    Grubbs test for one outlier

data:  X[[i]]
G = 1.80680, U = 0.81914, p-value = 0.6158
alternative hypothesis: highest value 0.963759744539857 is an outlier

$b

    Grubbs test for one outlier

data:  X[[i]]
G = 1.53140, U = 0.87008, p-value = 1
alternative hypothesis: highest value 0.975481075001881 is an outlier

$c

    Grubbs test for one outlier

data:  X[[i]]
G = 1.57910, U = 0.86186, p-value = 1
alternative hypothesis: lowest value 0.0136249314527959 is an outlier

You can access the results as follows:

> tests$a$statistic
        G         U 
1.8067906 0.8191417

Hope this helps.

R: bugs in flagging outliers (how R recognizes the length of a infinite decimal)

There are a few ways to deal with the problem. You can use all.equal or just test to see if numbers are nearly the same.

grubbs.flag <- function(x, tol=1e-9) {
    check <- function(a, b) any(abs(a - b) < tol)                    # check for nearly equal
    outliers <- NULL
    test <- x
    grubbs.result <- grubbs.test(test)
    pv <- grubbs.result$p.value
    while(pv < 0.05) {
        outliers <- c(outliers,as.numeric(strsplit(grubbs.result$alternative," ")[[1]][3]))
        inds <- sapply(test, check, outliers)                        # replace the %in% test
        test <- test[!inds]
        grubbs.result <- grubbs.test(test)
        pv <- grubbs.result$p.value
    }
    return(data.frame(X=x,Outlier=sapply(x, check, outliers)))       # replace %in% test
}

a=c(-1e6, 1,5,7,9,110, 1000)
b=3
c=a/b
grubbs.flag(c)

#              X Outlier
# 1 -3.333333e+05  TRUE
# 2 3.333333e-01   FALSE
# 3 1.666667e+00   FALSE
# 4 2.333333e+00   FALSE
# 5 3.000000e+00   FALSE
# 6 3.666667e+01    TRUE
# 7 3.333333e+02    TRUE

Several Grubbs tests simultaneously in R

You can use lapply:

library(outliers)

df = data.frame(a=runif(20),b=runif(20),c=runif(20))
tests = lapply(df,grubbs.test) 
# or with parameters:
tests = lapply(df,grubbs.test,opposite=T)

Results:

> tests
$a

    Grubbs test for one outlier

data:  X[[i]]
G = 1.80680, U = 0.81914, p-value = 0.6158
alternative hypothesis: highest value 0.963759744539857 is an outlier

$b

    Grubbs test for one outlier

data:  X[[i]]
G = 1.53140, U = 0.87008, p-value = 1
alternative hypothesis: highest value 0.975481075001881 is an outlier

$c

    Grubbs test for one outlier

data:  X[[i]]
G = 1.57910, U = 0.86186, p-value = 1
alternative hypothesis: lowest value 0.0136249314527959 is an outlier

You can access the results as follows:

> tests$a$statistic
        G         U 
1.8067906 0.8191417

Hope this helps.

Replacing values in df using index

You should add this before the while loop:

na.vect <- test

Because if it breaks beforehand, your na.vect won't exist and will thus throw an error. And then just run it on your dataframe like this:

apply(rock,2,grubbs.flag)

The second argument 2 tells to apply it to the columns of the dataframe. Use 1 for rows.

How to Repeat the Grubbs Test and Flag the Outliers