How to Get Outliers for All the Columns in a Dataframe in R

How to get outliers for all the columns in a dataframe in r

We create a function by selecting only the numeric columns (select_if), loop through those columns (map) and subset the elements that are not outliers. This will output as a list of vectors.

library(dplyr)
library(tidyr)
library(purrr)
outlierremoval <- function(dataframe){
dataframe %>%
select_if(is.numeric) %>% #selects on the numeric columns
map(~ .x[!.x %in% boxplot.stats(.)$out]) #%>%
# not clear whether we need to output as a list or data.frame
# if it is the latter, the columns could be of different length
# so we may use cbind.fill
# { do.call(rowr::cbind.fill, c(., list(fill = NA)))}

}

outlierremoval(Clean_Data)

If we want to keep all the other columns, then use map_if and append with NA at the end using cbind.fill to create a data.frame output. But, this will also result in change of position of rows in each column based on the number of outliers

outlierremoval <- function(dataframe){
dataframe %>%
map_if(is.numeric, ~ .x[!.x %in% boxplot.stats(.)$out]) %>%
{ do.call(rowr::cbind.fill, c(., list(fill = NA)))} %>%
set_names(names(dataframe))


}
res <- outlierremoval(Clean_Data)
head(res)
# X Observation Dist_Taxi Dist_Market Dist_Hospital Carpet Builtup Parking City_Category Rainfall House_Price
#1 1 1 9796 5250 10703 1659 1961 Open CAT B 530 6649000
#2 2 2 8294 8186 12694 1461 1752 Not Provided CAT B 210 3982000
#3 3 3 11001 14399 16991 1340 1609 Not Provided CAT A 720 5401000
#4 4 4 8301 11188 12289 1451 1748 Covered CAT B 620 5373000
#5 5 5 10510 12629 13921 1770 2111 Not Provided CAT B 450 4662000
#6 6 6 6665 5142 9972 1442 1733 Open CAT B 760 4526000

Update

If we need to get the outliers, in the map step we extract the outlier from the boxplot.stats

outliers <- function(dataframe){
dataframe %>%
select_if(is.numeric) %>%
map(~ boxplot.stats(.x)$out)


}
outliers(Clean_Data)

Or to replace the outliers with NA (which will also preserve the row positions)

outlierreplacement <- function(dataframe){
dataframe %>%
map_if(is.numeric, ~ replace(.x, .x %in% boxplot.stats(.x)$out, NA)) %>%
bind_cols



}
outlierreplacement(Clean_Data)

Removing outliers from multiple columns for dataframes in a list

I think @Duck's comment is very useful here. When you calculate the mean and the sd using the entire dataset, you are including the outliers to the calculation. That will not remove the three outliers in your example. You should restrict your data somehow before calculating the mean and the sd, and then based on those calculations, you can remove outliers. That is, you should remove some cases from the high/low end of the range. The question is, how many (or what proportion of) the cases you will exclude before calculating the mean and the sd? There, you can use the quantile function. Here is how I modified your function:

remove_outliers = function(df) {
for (i in 2:ncol(df)) {
dat = df[which(df[,i] > quantile(df[,i], .1) & df[,i] < quantile(df[,i], .9)),i]
mean = mean(dat)
sd = sd(dat)
df[which( abs((df[,i]) - mean) > (sd * 3)), i] = NA
}
return(df)
}

And here is the outcome when you apply that function to df1:

> remove_outliers(df1)
date_time XH_warmed_air_1m XH_ambient_air_1m
1 2019-01-01 25 25
2 2019-01-02 23 23
3 2019-01-03 26 26
4 2019-01-04 30 30
5 2019-01-05 10 10
6 2019-01-06 15 15
7 2019-01-07 12 12
8 2019-01-08 0 0
9 2019-01-09 1 1
10 2019-01-10 5 5
11 2019-01-11 -15 -15
12 2019-01-12 -12 -12
13 2019-01-13 -6 -6
14 2019-01-14 -1 -1
15 2019-01-15 NA NA
16 2019-01-16 NA NA
17 2019-01-17 NA NA

Also, as @dcarlson said, you are applying the function to a date_time column. I excluded that column from the function.

how to detect outliers in the columns of a dataframe? in R

Here's how you can do it for var1:

quantiles<-tapply(var1,names,quantile)
minq <- sapply(names, function(x) quantiles[[x]]["25%"])
maxq <- sapply(names, function(x) quantiles[[x]]["75%"])
var1[var1<minq | var1>maxq] <- NA

Repeat the same for var2 (or df$var2).

How to calculate outliers by columns in R?

To get position index of outliers (per column):

pos <- lapply(df, FindOutliers)

To get number of outliers (per column):

lengths(pos)

It is not a good idea to work with small sample size. Say, with your example df with sample size 6, only 851 is detected as an outlier in the last column, and 158 is not picked out.

Excluding outliers based on multiple columns in R ? IQR method

Separate the concerns:

  1. Identify outliers in a numeric vector using the IQR method. This can be encapsulated in a function taking a vector.
  2. Remove outliers from several columns of a data.frame. This is a function taking a data.frame.

I would suggest returning a boolean vector rather than indices. This way, the returned value is the size of the data which makes it easy to create a new column, for exampledf$outlier <- is_outlier(df$measure1).

Note how the argument names make it clear which type of input is expected: x is a standard name for a numeric vector and df is obviously a data.frame. cols is probably a list or vector of column names.

I made a point to only use base R but in real life I would use the dplyr package to manipulate the data.frame.

#' Detect outliers using IQR method
#'
#' @param x A numeric vector
#' @param na.rm Whether to exclude NAs when computing quantiles
#'
is_outlier <- function(x, na.rm = FALSE) {
qs = quantile(x, probs = c(0.25, 0.75), na.rm = na.rm)

lowerq <- qs[1]
upperq <- qs[2]
iqr = upperq - lowerq

extreme.threshold.upper = (iqr * 3) + upperq
extreme.threshold.lower = lowerq - (iqr * 3)

# Return logical vector
x > extreme.threshold.upper | x < extreme.threshold.lower
}

#' Remove rows with outliers in given columns
#'
#' Any row with at least 1 outlier will be removed
#'
#' @param df A data.frame
#' @param cols Names of the columns of interest. Defaults to all columns.
#'
#'
remove_outliers <- function(df, cols = names(df)) {
for (col in cols) {
cat("Removing outliers in column: ", col, " \n")
df <- df[!is_outlier(df[[col]]),]
}
df
}

Armed with these 2 functions, it becomes very easy:

df <- data.frame(ID = c(1001, 1002, 1003, 1004, 1005,   1006,   1007,   1008,   1009,   1010,   1011),
measure1 = rnorm(11, mean = 8, sd = 4),
measure2 = rnorm(11, mean = 40, sd = 5),
measure3 = rnorm(11, mean = 20, sd = 2),
measure4 = rnorm(11, mean = 9, sd = 3))

vars_of_interest <- c("measure1", "measure3", "measure4")

df_filtered <- remove_outliers(df, vars_of_interest)
#> Removing outliers in column: measure1
#> Removing outliers in column: measure3
#> Removing outliers in column: measure4

df_filtered
#> ID measure1 measure2 measure3 measure4
#> 1 1001 9.127817 40.10590 17.69416 8.6031175
#> 2 1002 18.196182 38.50589 23.65251 7.8630485
#> 3 1003 10.537458 37.97222 21.83248 6.0798316
#> 4 1004 5.590463 46.83458 21.75404 6.9589981
#> 5 1005 14.079801 38.47557 20.93920 -0.6370596
#> 6 1006 3.830089 37.19281 19.56507 6.2165156
#> 7 1007 14.644766 37.09235 19.78774 10.5133674
#> 8 1008 5.462400 41.02952 20.14375 13.5247993
#> 9 1009 5.215756 37.65319 22.23384 7.3131715
#> 10 1010 14.518045 48.97977 20.33128 9.9482211
#> 11 1011 1.594353 44.09224 21.32434 11.1561089

Created on 2020-03-23 by the reprex package (v0.3.0)

Remove outliers for all columns with R

You may try summarise_each from dplyr and apply the median and calcul.mad. Once we got that, we can calculate the uper.interval and lower.interval after reshaping the Sum_f1. Then, apply the function_Data once we get all the values.

library(dplyr)
Sum_f1 <- summarise_each(data_f,funs(median, calcul.mad))

n <- 2*ncol(data_f)

dl <- reshape(Sum_f1, idvar='id', direction='long', sep="_",
varying=split(seq(n), as.numeric(gl(n,n/2,n))))

up_data <- mapply(uper.interval, dl[,2], dl[,3])
low_data <- mapply(lower.interval, dl[,2], dl[,3])
data_f1 <- data_f
data_f1[] <- Map(functionData, data_f, up_data, low_data)


Related Topics



Leave a reply



Submit