How to Detect and Remove Outliers from Each Column of Pandas Dataframe At One Go

How to detect and remove outliers from each column of pandas dataframe at one go?

The problem is that your outliers in each column may happen for varying rows(records). I'd advise you be happy with substituting np.nan

Setup

np.random.seed([3, 1415])
df = pd.DataFrame(
    np.random.normal(size=(20, 8)),
    columns=list('ABCDEFGH')
)

df

           A         B         C         D         E         F         G         H
0  -2.129724 -1.268466 -1.970500 -2.259055 -0.349286 -0.026955  0.316236  0.348782
1   0.715364  0.770763 -0.608208  0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2   0.746839 -1.504157  0.611362  0.400219 -0.959443  1.494226 -0.346508 -1.471558
3   1.063243  1.062997  0.591860  0.296212 -0.774732  0.831452  1.486976  0.256220
4  -0.899906  0.375085 -0.519501  0.050101  0.949959 -1.033773  0.948247  0.733776
5   1.236118  0.155475 -1.341267  0.162864  1.258253  0.778040  1.341599 -1.636039
6  -0.195368  0.131820  2.069013  0.048729 -1.500564  0.907342  0.029326  0.066119
7  -0.728821 -2.137846  1.402702 -0.017209 -0.071309 -0.533061  1.273899  0.348510
8  -0.920391  0.348579 -0.835074 -0.225377  0.206295 -0.582825 -1.511850  1.633570
9   0.403321  0.992765  0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203  0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11  0.068855 -0.525141  0.745524  0.911930 -0.277785 -0.866313  1.155518  1.421480
12  1.416653 -0.120607  1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13  2.222507  0.668158  0.463331 -0.302869  0.226355 -0.966131  1.015160 -0.329008
14 -1.070002  0.525867  0.616915  0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15  0.058956  0.242391 -0.660237 -0.081101  1.690625  0.296406 -0.938197  0.225710
16 -0.352254  0.170126 -0.943541  0.627847 -0.948773  0.126131  1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051  0.895515 -0.234507  1.005886 -1.350465 -0.959034
18  0.992524 -1.471428  0.270001 -1.197004 -0.324760 -1.383568  0.838075 -1.125205
19  0.024837  0.238895  0.350742 -0.541868 -0.730284  0.113695  0.068872 -0.032520

pandas.DataFrame.mask

df.mask((df - df.mean()).abs() > 2 * df.std())

           A         B         C         D         E         F         G         H
0        NaN -1.268466       NaN       NaN -0.349286 -0.026955  0.316236  0.348782
1   0.715364  0.770763 -0.608208  0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2   0.746839 -1.504157  0.611362  0.400219 -0.959443       NaN -0.346508 -1.471558
3   1.063243  1.062997  0.591860  0.296212 -0.774732  0.831452  1.486976  0.256220
4  -0.899906  0.375085 -0.519501  0.050101  0.949959 -1.033773  0.948247  0.733776
5   1.236118  0.155475 -1.341267  0.162864  1.258253  0.778040  1.341599 -1.636039
6  -0.195368  0.131820  2.069013  0.048729 -1.500564  0.907342  0.029326  0.066119
7  -0.728821       NaN  1.402702 -0.017209 -0.071309 -0.533061  1.273899  0.348510
8  -0.920391  0.348579 -0.835074 -0.225377  0.206295 -0.582825 -1.511850       NaN
9   0.403321  0.992765  0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203  0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11  0.068855 -0.525141  0.745524  0.911930 -0.277785 -0.866313  1.155518  1.421480
12  1.416653 -0.120607  1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13       NaN  0.668158  0.463331 -0.302869  0.226355 -0.966131  1.015160 -0.329008
14 -1.070002  0.525867  0.616915  0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15  0.058956  0.242391 -0.660237 -0.081101       NaN  0.296406 -0.938197  0.225710
16 -0.352254  0.170126 -0.943541  0.627847 -0.948773  0.126131  1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051  0.895515 -0.234507  1.005886 -1.350465 -0.959034
18  0.992524 -1.471428  0.270001 -1.197004 -0.324760 -1.383568  0.838075 -1.125205
19  0.024837  0.238895  0.350742 -0.541868 -0.730284  0.113695  0.068872 -0.032520

+ dropna

If you only want rows for which no outliers exist for any column, you could follow up the above with dropna

df.mask((df - df.mean()).abs() > 2 * df.std()).dropna()



      A         B         C         D         E         F         G         H
1   0.715364  0.770763 -0.608208  0.352390 -0.352521 -0.415869 -0.911575 -0.142538
3   1.063243  1.062997  0.591860  0.296212 -0.774732  0.831452  1.486976  0.256220
4  -0.899906  0.375085 -0.519501  0.050101  0.949959 -1.033773  0.948247  0.733776
5   1.236118  0.155475 -1.341267  0.162864  1.258253  0.778040  1.341599 -1.636039
6  -0.195368  0.131820  2.069013  0.048729 -1.500564  0.907342  0.029326  0.066119
9   0.403321  0.992765  0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203  0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11  0.068855 -0.525141  0.745524  0.911930 -0.277785 -0.866313  1.155518  1.421480
12  1.416653 -0.120607  1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
14 -1.070002  0.525867  0.616915  0.399136 -0.233075 -0.482919 -1.018142 -1.673869
16 -0.352254  0.170126 -0.943541  0.627847 -0.948773  0.126131  1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051  0.895515 -0.234507  1.005886 -1.350465 -0.959034
18  0.992524 -1.471428  0.270001 -1.197004 -0.324760 -1.383568  0.838075 -1.125205
19  0.024837  0.238895  0.350742 -0.541868 -0.730284  0.113695  0.068872 -0.032520

Detect Outliers across all columns of Pandas Dataframe

It looks like I just had to change my function in put and iterate over each column of the dataframe to do the trick:

def find_outliers(col):
    q1 = col.quantile(.25)
    q3 = col.quantile(.75)
    IQR = q3 - q1
    ll = q1 - (1.5*IQR)
    ul = q3 + (1.5*IQR)
    upper_outliers = col[col > ul].index.tolist()
    lower_outliers = col[col < ll].index.tolist()
    bad_indices = list(set(upper_outliers + lower_outliers))
    return(bad_indices)

import numpy as np
bad_indexes = []
for col in df.columns:
    if df[col].dtype in ["int64","float64"]:
        bad_indexes.append(find_outliers(df[col]))

bad_indexes = set(list(np.concatenate(bad_indexes).flat))
print(len(bad_indexes))

Eliminate outliers in a dataframe with different dtypes - Pandas

Try using select_dtypes to get all columns from df of a particular type.

To select all numeric types, use np.number or 'number'

new_df = df[
    (np.abs(stats.zscore(df.select_dtypes(include=np.number))) < 3).all(axis=1)
]

How to identify and highlight outliers in each row of a pandas dataframe

This works for me

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed([5, 1591])
df = pd.DataFrame(
    np.random.normal(size=(16, 5)),
    columns=list('ABCDE')
)


mask = pd.DataFrame(abs(stats.zscore(df)) > 1, columns=df.columns)
df["Count"] = mask.sum(axis=1)
mask["Count"] = False
style_df = mask.applymap(lambda x: "background-color: red" if x else "")

sheet_name = "Values"
with pd.ExcelWriter("score_test.xlsx", engine="openpyxl") as writer:
    df.style.apply(lambda x: style_df, axis=None).to_excel(writer, 
                                                           sheet_name=sheet_name,
                                                           index=False)

Here the mask is the boolean conditional where we have true if zscore exceeds the limit. Based on this boolean mask I create a string dataframe 'style_df' with the values 'background: red' on the deviating cells. The values of the style_df is imposed with the last statement on the style of the df dataframe.

The resulting excel file looks now like this
contents of excel file

Remove Outliers in Pandas DataFrame using Percentiles

The initial dataset.

print(df.head())

   Col0  Col1  Col2  Col3  Col4  User_id
0    49    31    93    53    39       44
1    69    13    84    58    24       47
2    41    71     2    43    58       64
3    35    56    69    55    36       67
4    64    24    12    18    99       67

First removing the User_id column

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)

       Col0   Col1  Col2   Col3   Col4
0.05   2.00   3.00   6.9   3.95   4.00
0.95  95.05  89.05  93.0  94.00  97.05

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
                                    (x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_id back.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)
print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
1       47    69    13    84    58    24
3       67    35    56    69    55    36
5        9    95    79    44    45    69
6       83    69    41    66    87     6
9       87    50    54    39    53    40

Checking result

print(filt_df.head())

   User_id  Col0  Col1  Col2  Col3  Col4
0       44    49    31   NaN    53    39
1       47    69    13    84    58    24
2       64    41    71   NaN    43    58
3       67    35    56    69    55    36
4       67    64    24    12    18   NaN

print(filt_df.describe())

          User_id       Col0       Col1       Col2       Col3       Col4
count  100.000000  89.000000  88.000000  88.000000  89.000000  89.000000
mean    48.230000  49.573034  45.659091  52.727273  47.460674  57.157303
std     28.372292  25.672274  23.537149  26.509477  25.823728  26.231876
min      0.000000   3.000000   5.000000   7.000000   4.000000   5.000000
25%     23.000000  29.000000  29.000000  29.500000  24.000000  36.000000
50%     47.000000  50.000000  40.500000  52.500000  49.000000  59.000000
75%     74.250000  69.000000  67.000000  75.000000  70.000000  79.000000
max     99.000000  95.000000  89.000000  92.000000  91.000000  97.000000

How to generate the test dataset

np.random.seed(0)
nb_sample = 100
num_sample = (0,100)

d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
    d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)

df = DataFrame.from_dict(d)

Automating removing outliers from a pandas dataframe using IQR as the parameter and putting the variables in a list

Based on comments on the original post, I suggest you do the following and revamp your solution.

I believe this answer provides a quick solution to your problem, so remember to search on SO before posting. This will remove all rows where one (or more) of the wanted column values is an outlier.

cols = ['pdays', 'campaign', 'previous'] # The columns you want to search for outliers in

# Calculate quantiles and IQR
Q1 = dummy_df[cols].quantile(0.25) # Same as np.percentile but maps (0,1) and not (0,100)
Q3 = dummy_df[cols].quantile(0.75)
IQR = Q3 - Q1

# Return a boolean array of the rows with (any) non-outlier column values
condition = ~((dummy_df[cols] < (Q1 - 1.5 * IQR)) | (dummy_df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)

# Filter our dataframe based on condition
filtered_df = dummy_df[condition]

How to Detect and Remove Outliers from Each Column of Pandas Dataframe At One Go