Calculating the Outliers in R

How to calculate outliers by columns in R?

To get position index of outliers (per column):

pos <- lapply(df, FindOutliers)

To get number of outliers (per column):

lengths(pos)

It is not a good idea to work with small sample size. Say, with your example df with sample size 6, only 851 is detected as an outlier in the last column, and 158 is not picked out.

Identifying the outliers in a data set in R

You can get this using boxplot. If your variable is x,

OutVals = boxplot(x)$out
which(x %in% OutVals)

If you are annoyed by the plot, you could use

OutVals = boxplot(x, plot=FALSE)$out

Calculate outliers of a group of specific columns then identify ids which have 5 columns with outliers

You could create a function which returns the id's of outliers

find_outlier <- function(df, x) {
  uthr = mean(x)+3*sd(x)
  rm_u_ids = df$id[which(x >= uthr)]
  # id those with lower outliers
  lthr = mean(x)-3*sd(x)
  rm_l_ids = df$id[which(x <= lthr)]
  # remove those with both upper and lower outliers
  unique(sort(c(rm_u_ids, rm_l_ids)))
}

Apply it to every colors column, calculate their count with table and remove the id's which occur more than 5 times.

all_ids <- lapply(df[colors], find_outlier, df = df)

temp_tab <- table(unlist(all_ids))
remove_ids <- names(temp_tab[temp_tab >= 5])
subset(df, !id %in% remove_ids)

Identifying outliers in R data by factor levels and columns

To start with the data you provided...

df = data.frame(
  species = c("a","b","a","b","a","b","a","b","a","b"),
  uniqueID = c("x01","x02","x03","x04","x05","x06","x07","x08","x09","x10"),
  metric1 = c(1,2,3,1,2,3,1,2,3,11),
  metric2 = c(4,5,6,4,5,6,55,4,5,6),
  metric3 = c(0.7,7,8,9,7,8,9,77,8,9)
)

I'm going to use the tidyverse liberally here...

library(tidyverse)

Then to triple the rows so the standard deviation calculation doesn't die on us, and to add another outlier row...

df2 <- df %>% 
  bind_rows(df) %>% 
  bind_rows(df) %>% 
  add_row(
    species = "a",
    uniqueID = "x01",
    metric1 = 1,
    metric2 = 4,
    metric3 = 1e12
  )

What if you tried something like this?

df2 %>% 
  gather(key = "metric", value = "value", -species, -uniqueID) %>% 
  group_by(species, uniqueID, metric) %>% 
  arrange(species, uniqueID, metric) %>% # just to make the results easy to scan
  mutate(
    mean_obs = mapply(function(x) mean(value[-x]), 1:n()),
    stdev    = mapply(function(x)   sd(value[-x]), 1:n()),
    minimum  = mean_obs - stdev * 2,
    maximum  = mean_obs + stdev * 2,
    outlier  = value < minimum | value > maximum
  ) %>% 
  filter(outlier) %>% 
  glimpse()

It borrows from this answer to find the mean and standard deviation excluding the current record, and then marks a row as an outlier if the row is more than 2 SD from the mean.

It can get weird if you exclude the current record and the record is not an outlier and it appreciably changes the mean and standard deviation. But then if the record is an outlier, you definitely want to do that. :)

How to identify and remove outliers in a data.frame using R?

The identify_outliers expect a data.frame as input i.e. usage is

identify_outliers(data, ..., variable = NULL)

where

... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.

df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)

Finding outliers for all specific subsets in R?

You're already using dplyr, so I suggest you use group_by, as it (to me) is a more natural way of dealing with the data.

Also, this part is incorrect syntax:

    data %>%
      filter(measure == d, condition ==c) %>%
      o <- outlier(data$value) %>%
      print(o)

Why?

The filter(...) %>% should be piping to something that accepts a frame, but ... you're sending the output from filter into an assignment o <- outlier(...) (and then to print(o), which really means print(., o) where . is the output from the previous command.
Further, since o is not yet defined the first time this runs ... you should get an error about object 'o' not found. You won't get it on subsequent passes in the loop, since it does exist ... but if so then it's the outliers from the previous iteration in the loops. Certainly not what you should be using.

A direct correction of that code might be:

for (...) {
  for (...) {
    o <- data %>%
      filter(measure == d, condition ==c) %>%
      do({ data.frame(outliers = outlier(.$value)) })
    print(o)
  }
}

where o will be a data.frame (well, tbl_df tibble) with three columns: measure, condition, and outliers. The use of do is required in this case because most non-tidyverse functions ignore group_by groupings, so we use do to side-step that problem.

Perhaps this, though, to replace both loops into a single command:

data %>%
  group_by(measure, condition) %>%
  summarize(outliers = outlier(value)) %>%
  ungroup()

I'm assuming that what you want is all outlier values for each unique combination of measure and condition, and that the outlier(.) function returns a vector (of some length >= 1). If no outliers are found, the measure/condition pair will not be included ... if this is a factor, then use something like

data %>%
  group_by(measure, condition) %>%
  summarize(outliers = list(outlier(value))) %>%
  tidyr::unnest(outliers, keep_empty = TRUE) %>%
  ungroup()

Find outliers by Standard Deviation from mean, replace with NA in large dataset (6000+ columns)

You can use a ifelse function, here an example using dplyr and applying the ifelse function over all columns containing the term HUMAN:

library(dplyr)
data %>% mutate_at(.vars = vars(contains("HUMAN")), 
                   .funs= ~ifelse(abs(.)>mean(.)+2*sd(.), NA, .))