Identifying the outliers in a data set in R
You can get this using boxplot
. If your variable is x,
OutVals = boxplot(x)$out
which(x %in% OutVals)
If you are annoyed by the plot, you could use
OutVals = boxplot(x, plot=FALSE)$out
How can I identify inconsistencies and outliers in a dataset in R
For the specific examples you've given:
- an ID must be be four numbers and 4 letters:
!grepl("^[0-9]{4}-[[:alpha:]]{4}$", ID)
will be TRUE for inconsistent values (^
and $
mean beginning- and end-of-string respectively; {4}
means "previous pattern repeats exactly four times"; [0-9]
means "any symbol between 0 and 9 (i.e. any numeral); [[:alpha:]]
means "any alphabetic character"). If you only want uppercase letters you could use [A-Z]
instead (assuming you are not working in some weird locale like Estonian).
If you need a numeric value to be 0 or 1, then
!num_val %in% c(0,1)
will work (this will work for any set of allowed values; you can use it for a specific set of allowed character values as well)If you need a numeric value to be between
a
andb
then!(a < num_val & num_val < b)
...
Identifying outliers in R data by factor levels and columns
To start with the data you provided...
df = data.frame(
species = c("a","b","a","b","a","b","a","b","a","b"),
uniqueID = c("x01","x02","x03","x04","x05","x06","x07","x08","x09","x10"),
metric1 = c(1,2,3,1,2,3,1,2,3,11),
metric2 = c(4,5,6,4,5,6,55,4,5,6),
metric3 = c(0.7,7,8,9,7,8,9,77,8,9)
)
I'm going to use the tidyverse
liberally here...
library(tidyverse)
Then to triple the rows so the standard deviation calculation doesn't die on us, and to add another outlier row...
df2 <- df %>%
bind_rows(df) %>%
bind_rows(df) %>%
add_row(
species = "a",
uniqueID = "x01",
metric1 = 1,
metric2 = 4,
metric3 = 1e12
)
What if you tried something like this?
df2 %>%
gather(key = "metric", value = "value", -species, -uniqueID) %>%
group_by(species, uniqueID, metric) %>%
arrange(species, uniqueID, metric) %>% # just to make the results easy to scan
mutate(
mean_obs = mapply(function(x) mean(value[-x]), 1:n()),
stdev = mapply(function(x) sd(value[-x]), 1:n()),
minimum = mean_obs - stdev * 2,
maximum = mean_obs + stdev * 2,
outlier = value < minimum | value > maximum
) %>%
filter(outlier) %>%
glimpse()
It borrows from this answer to find the mean and standard deviation excluding the current record, and then marks a row as an outlier if the row is more than 2 SD from the mean.
It can get weird if you exclude the current record and the record is not an outlier and it appreciably changes the mean and standard deviation. But then if the record is an outlier, you definitely want to do that. :)
How to identify and remove outliers in a data.frame using R?
The identify_outliers
expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
How to calculate outliers by columns in R?
To get position index of outliers (per column):
pos <- lapply(df, FindOutliers)
To get number of outliers (per column):
lengths(pos)
It is not a good idea to work with small sample size. Say, with your example df
with sample size 6, only 851 is detected as an outlier in the last column, and 158 is not picked out.
Identifying several variable outliers with rstatix
The beauty of rstatix
is that it is pipe friendly. So, you can use it with tidyverse
framework. tidyverse
requires the data in long-form. You can use the following code
library(tidyverse)
library(rstatix)
ef.personality %>%
mutate(id = seq(1, nrow(ef.personality),1)) %>% #To create a unique column required to make that data in long form
pivot_longer(-id) %>% #To make the data in long form required for `tidyverse`
group_by(name) %>% #Based on which column you want aggregate
identify_outliers(value) %>%
select(name, is.extreme)
Related Topics
Different Robust Standard Errors of Logit Regression in Stata and R
Passing Parameters to R Markdown
Double Clustered Standard Errors for Panel Data
Ternary Plot and Filled Contour
How to Display Emojis in Ggplot2 Using Emo Package in R
Check If String Contains Only Numbers or Only Characters (R)
Element-Wise Concatenation of String Vectors
Show Content for Menuitem When Menusubitems Exist in Shiny Dashboard
Data.Table Package in R 3.5 Does Not Install
Extract Part of String Before the First Semicolon
Using R to Read Out Excel-Colorinfo
Get the Event Which Is Fired in Shiny
In R Combine a List of Lists into One List
Best Practice: Should I Try to Change to Utf-8 as Locale or Is It Safe to Leave It as Is