How to calculate outliers by columns in R?
To get position index of outliers (per column):
pos <- lapply(df, FindOutliers)
To get number of outliers (per column):
lengths(pos)
It is not a good idea to work with small sample size. Say, with your example df
with sample size 6, only 851 is detected as an outlier in the last column, and 158 is not picked out.
Identifying the outliers in a data set in R
You can get this using boxplot
. If your variable is x,
OutVals = boxplot(x)$out
which(x %in% OutVals)
If you are annoyed by the plot, you could use
OutVals = boxplot(x, plot=FALSE)$out
Calculate outliers of a group of specific columns then identify ids which have 5 columns with outliers
You could create a function which returns the id
's of outliers
find_outlier <- function(df, x) {
uthr = mean(x)+3*sd(x)
rm_u_ids = df$id[which(x >= uthr)]
# id those with lower outliers
lthr = mean(x)-3*sd(x)
rm_l_ids = df$id[which(x <= lthr)]
# remove those with both upper and lower outliers
unique(sort(c(rm_u_ids, rm_l_ids)))
}
Apply it to every colors
column, calculate their count with table
and remove the id
's which occur more than 5 times.
all_ids <- lapply(df[colors], find_outlier, df = df)
temp_tab <- table(unlist(all_ids))
remove_ids <- names(temp_tab[temp_tab >= 5])
subset(df, !id %in% remove_ids)
Identifying outliers in R data by factor levels and columns
To start with the data you provided...
df = data.frame(
species = c("a","b","a","b","a","b","a","b","a","b"),
uniqueID = c("x01","x02","x03","x04","x05","x06","x07","x08","x09","x10"),
metric1 = c(1,2,3,1,2,3,1,2,3,11),
metric2 = c(4,5,6,4,5,6,55,4,5,6),
metric3 = c(0.7,7,8,9,7,8,9,77,8,9)
)
I'm going to use the tidyverse
liberally here...
library(tidyverse)
Then to triple the rows so the standard deviation calculation doesn't die on us, and to add another outlier row...
df2 <- df %>%
bind_rows(df) %>%
bind_rows(df) %>%
add_row(
species = "a",
uniqueID = "x01",
metric1 = 1,
metric2 = 4,
metric3 = 1e12
)
What if you tried something like this?
df2 %>%
gather(key = "metric", value = "value", -species, -uniqueID) %>%
group_by(species, uniqueID, metric) %>%
arrange(species, uniqueID, metric) %>% # just to make the results easy to scan
mutate(
mean_obs = mapply(function(x) mean(value[-x]), 1:n()),
stdev = mapply(function(x) sd(value[-x]), 1:n()),
minimum = mean_obs - stdev * 2,
maximum = mean_obs + stdev * 2,
outlier = value < minimum | value > maximum
) %>%
filter(outlier) %>%
glimpse()
It borrows from this answer to find the mean and standard deviation excluding the current record, and then marks a row as an outlier if the row is more than 2 SD from the mean.
It can get weird if you exclude the current record and the record is not an outlier and it appreciably changes the mean and standard deviation. But then if the record is an outlier, you definitely want to do that. :)
How to identify and remove outliers in a data.frame using R?
The identify_outliers
expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
Finding outliers for all specific subsets in R?
You're already using dplyr
, so I suggest you use group_by
, as it (to me) is a more natural way of dealing with the data.
Also, this part is incorrect syntax:
data %>%
filter(measure == d, condition ==c) %>%
o <- outlier(data$value) %>%
print(o)
Why?
The
filter(...) %>%
should be piping to something that accepts a frame, but ... you're sending the output fromfilter
into an assignmento <- outlier(...)
(and then toprint(o)
, which really meansprint(., o)
where.
is the output from the previous command.Further, since
o
is not yet defined the first time this runs ... you should get an error aboutobject 'o' not found
. You won't get it on subsequent passes in the loop, since it does exist ... but if so then it's the outliers from the previous iteration in the loops. Certainly not what you should be using.
A direct correction of that code might be:
for (...) {
for (...) {
o <- data %>%
filter(measure == d, condition ==c) %>%
do({ data.frame(outliers = outlier(.$value)) })
print(o)
}
}
where o
will be a data.frame
(well, tbl_df
tibble) with three columns: measure
, condition
, and outliers
. The use of do
is required in this case because most non-tidyverse functions ignore group_by
groupings, so we use do
to side-step that problem.
Perhaps this, though, to replace both loops into a single command:
data %>%
group_by(measure, condition) %>%
summarize(outliers = outlier(value)) %>%
ungroup()
I'm assuming that what you want is all outlier values for each unique combination of measure
and condition
, and that the outlier(.)
function returns a vector (of some length >= 1). If no outliers are found, the measure
/condition
pair will not be included ... if this is a factor, then use something like
data %>%
group_by(measure, condition) %>%
summarize(outliers = list(outlier(value))) %>%
tidyr::unnest(outliers, keep_empty = TRUE) %>%
ungroup()
Find outliers by Standard Deviation from mean, replace with NA in large dataset (6000+ columns)
You can use a ifelse
function, here an example using dplyr
and applying the ifelse
function over all columns containing the term HUMAN
:
library(dplyr)
data %>% mutate_at(.vars = vars(contains("HUMAN")),
.funs= ~ifelse(abs(.)>mean(.)+2*sd(.), NA, .))
Related Topics
Using Rcpp Functions Inside of R's Par*Apply Functions from the Parallel Package
Loop Character Values in Ggtitle
Subtracting Values Group-Wise by the Average of Each Group in R
Remove Strings Found in Vector 1, from Vector 2
R How to Convert a Numeric into Factor with Predefined Labels
Shiny Dynamic Filter Variable Selection and Display of Variable Values for Selection
Importing Data into R from Google Spreadsheet
Doing a Plyr Operation on Every Row of a Data Frame in R
How to Create, Structure, Maintain and Update Data Codebooks in R
What Are 'User' and 'System' Times Measuring in R System.Time(Exp) Output
How to Clean Up R Memory Without Restarting My Pc
Plot Size and Resolution with R Markdown, Knitr, Pandoc, Beamer
How to Remove Na from Facet_Wrap in Ggplot2
R How to Convert a Numeric into Factor with Predefined Labels
Avoid Wasting Space When Placing Multiple Aligned Plots Onto One Page
How to Extract Fitted Splines from a Gam ('Mgcv::Gam')