How to remove outliers from a dataset
OK, you should apply something like this to your dataset. Do not replace & save or you'll destroy your data! And, btw, you should (almost) never remove outliers from your data:
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
To see it in action:
set.seed(1)
x <- rnorm(100)
x <- c(-10, x, 10)
y <- remove_outliers(x)
## png()
par(mfrow = c(1, 2))
boxplot(x)
boxplot(y)
## dev.off()
And once again, you should never do this on your own, outliers are just meant to be! =)
EDIT: I added na.rm = TRUE
as default.
EDIT2: Removed quantile
function, added subscripting, hence made the function faster! =)
R - how to remove outliers from dataset by two different groups
You could use an anti_join()
in dplyr
after getting the outliers. Note, when in my df_outliers
I only have ID
s 1, 7 and 10.
library(tidyverse)
library(rstatix)
df <- tibble(
ID = c(1L,2L,3L,4L,5L,6L,7L,8L,
9L,10L,11L,12L,13L,14L,15L,16L,17L,18L,19L,
20L,21L,22L,23L,24L,25L,26L,27L,28L,19L),
Treatment = c("A","A","A","A","A","A",
"A","A","A","A","A","A","A","A","A","A","A","A",
"A","A","B","B","B","B","B","B","B","B","B"),
conc = c(40,40,40,40,40,40,20,20,
20,20,20,20,10,10,10,10,10,10,5,5,0.8,0.8,
0.8,0.8,0.8,0.8,0.6,0.6,0.6),
relabs = c(1.0793923,0.6436631,0.5556844,
0.4834845,0.7224756,0.6804259,0.9958288,0.709936,
0.7028124,0.5016352,0.6860346,0.734197,0.8175491,
0.690091,0.5278228,0.7560026,0.8841343,0.6687616,
0.8563232,0.7419997,1.2049695,0.4969811,0.2835814,0.670025,
1.3126651,0.4510617,0.7629639,0.7513716,0.7956074)
)
df_outliers <- df %>%
group_by(Treatment, conc) %>%
identify_outliers("relabs")
# A tibble: 3 x 6
Treatment conc ID relabs is.outlier is.extreme
<chr> <dbl> <int> <dbl> <lgl> <lgl>
1 A 20 7 0.996 TRUE TRUE
2 A 20 10 0.502 TRUE TRUE
3 A 40 1 1.08 TRUE FALSE
# without outliers
df %>%
anti_join(df_outliers, by = "ID") %>%
view()
# A tibble: 26 x 4
ID Treatment conc relabs
<int> <chr> <dbl> <dbl>
1 2 A 40 0.644
2 3 A 40 0.556
3 4 A 40 0.483
4 5 A 40 0.722
5 6 A 40 0.680
6 8 A 20 0.710
7 9 A 20 0.703
8 11 A 20 0.686
9 12 A 20 0.734
10 13 A 10 0.818
# … with 16 more rows
How to remove outliers from a dataset using bivariate boxplot
You can extract the desired subset like so:
How to remove Outliers in Python?
You're having trouble with your code because you're trying to calculate zscore
on categorical columns.
To avoid this, you should first separate your train into parts with numerical and categorical features:
num_train = train.select_dtypes(include=["number"])
cat_train = train.select_dtypes(exclude=["number"])
and only after that calculate index of rows to keep:
idx = np.all(stats.zscore(num_train) < 3, axis=1)
and finally add the two pieces together:
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)
For IQR part:
Q1 = num_train.quantile(0.02)
Q3 = num_train.quantile(0.98)
IQR = Q3 - Q1
idx = ~((num_train < (Q1 - 1.5 * IQR)) | (num_train > (Q3 + 1.5 * IQR))).any(axis=1)
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)
Please let us know if you have any further questions.
PS
As well, you might consider one more approach for dealing with outliers with pandas.DataFrame.clip, which will clip outliers on a case-by-case basis instead of dropping a row altogether.
Related Topics
Using Data.Table Package Inside My Own Package
Read All Worksheets in an Excel Workbook into an R List With Data.Frames
Remove Columns from Dataframe Where All Values Are Na
R Conditional Evaluation When Using the Pipe Operator %≫%
How to Install Packages in Latest Version of Rstudio and R Version.3.1.1
How to Add Percentage or Count Labels Above Percentage Bar Plot
Merge Several Data.Frames into One Data.Frame With a Loop
What Are the Main Differences Between R Data Files
Combining Two Data Frames of Different Lengths
How to Add Code Folding to Output Chunks in Rmarkdown HTML Documents
How to See the Source Code of R .Internal or .Primitive Function
Why Is Rbindlist "Better" Than Rbind
R Install.Packages Returns "Failed to Create Lock Directory"
Check If the Number Is Integer