Identifying the Outliers in a Data Set in R

Identifying the outliers in a data set in R

You can get this using boxplot. If your variable is x,

OutVals = boxplot(x)$out
which(x %in% OutVals)

If you are annoyed by the plot, you could use

OutVals = boxplot(x, plot=FALSE)$out

How can I identify inconsistencies and outliers in a dataset in R

For the specific examples you've given:

an ID must be be four numbers and 4 letters:

!grepl("^[0-9]{4}-[[:alpha:]]{4}$", ID)

will be TRUE for inconsistent values (^ and $ mean beginning- and end-of-string respectively; {4} means "previous pattern repeats exactly four times"; [0-9] means "any symbol between 0 and 9 (i.e. any numeral); [[:alpha:]] means "any alphabetic character"). If you only want uppercase letters you could use [A-Z] instead (assuming you are not working in some weird locale like Estonian).

If you need a numeric value to be 0 or 1, then !num_val %in% c(0,1) will work (this will work for any set of allowed values; you can use it for a specific set of allowed character values as well)
If you need a numeric value to be between a and b then !(a < num_val & num_val < b) ...

Identifying outliers in R data by factor levels and columns

To start with the data you provided...

df = data.frame(
  species = c("a","b","a","b","a","b","a","b","a","b"),
  uniqueID = c("x01","x02","x03","x04","x05","x06","x07","x08","x09","x10"),
  metric1 = c(1,2,3,1,2,3,1,2,3,11),
  metric2 = c(4,5,6,4,5,6,55,4,5,6),
  metric3 = c(0.7,7,8,9,7,8,9,77,8,9)
)

I'm going to use the tidyverse liberally here...

library(tidyverse)

Then to triple the rows so the standard deviation calculation doesn't die on us, and to add another outlier row...

df2 <- df %>% 
  bind_rows(df) %>% 
  bind_rows(df) %>% 
  add_row(
    species = "a",
    uniqueID = "x01",
    metric1 = 1,
    metric2 = 4,
    metric3 = 1e12
  )

What if you tried something like this?

df2 %>% 
  gather(key = "metric", value = "value", -species, -uniqueID) %>% 
  group_by(species, uniqueID, metric) %>% 
  arrange(species, uniqueID, metric) %>% # just to make the results easy to scan
  mutate(
    mean_obs = mapply(function(x) mean(value[-x]), 1:n()),
    stdev    = mapply(function(x)   sd(value[-x]), 1:n()),
    minimum  = mean_obs - stdev * 2,
    maximum  = mean_obs + stdev * 2,
    outlier  = value < minimum | value > maximum
  ) %>% 
  filter(outlier) %>% 
  glimpse()

It borrows from this answer to find the mean and standard deviation excluding the current record, and then marks a row as an outlier if the row is more than 2 SD from the mean.

It can get weird if you exclude the current record and the record is not an outlier and it appreciably changes the mean and standard deviation. But then if the record is an outlier, you definitely want to do that. :)

How to identify and remove outliers in a data.frame using R?

The identify_outliers expect a data.frame as input i.e. usage is

identify_outliers(data, ..., variable = NULL)

where

... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.

df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)

How to calculate outliers by columns in R?

To get position index of outliers (per column):

pos <- lapply(df, FindOutliers)

To get number of outliers (per column):

lengths(pos)

It is not a good idea to work with small sample size. Say, with your example df with sample size 6, only 851 is detected as an outlier in the last column, and 158 is not picked out.

Identifying several variable outliers with rstatix

The beauty of rstatix is that it is pipe friendly. So, you can use it with tidyverse framework. tidyverse requires the data in long-form. You can use the following code

library(tidyverse)
library(rstatix)

ef.personality %>% 
  mutate(id = seq(1, nrow(ef.personality),1)) %>% #To create a unique column required to make that data in long form 
  pivot_longer(-id) %>% #To make the data in long form required for `tidyverse`
  group_by(name) %>% #Based on which column you want aggregate 
  identify_outliers(value) %>% 
  select(name, is.extreme)

Identifying the Outliers in a Data Set in R