How to Detect and Remove Outliers from Each Column of Pandas Dataframe At One Go

How to detect and remove outliers from each column of pandas dataframe at one go?

The problem is that your outliers in each column may happen for varying rows(records). I'd advise you be happy with substituting np.nan

Setup

np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.normal(size=(20, 8)),
columns=list('ABCDEFGH')
)

df

A B C D E F G H
0 -2.129724 -1.268466 -1.970500 -2.259055 -0.349286 -0.026955 0.316236 0.348782
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2 0.746839 -1.504157 0.611362 0.400219 -0.959443 1.494226 -0.346508 -1.471558
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
7 -0.728821 -2.137846 1.402702 -0.017209 -0.071309 -0.533061 1.273899 0.348510
8 -0.920391 0.348579 -0.835074 -0.225377 0.206295 -0.582825 -1.511850 1.633570
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13 2.222507 0.668158 0.463331 -0.302869 0.226355 -0.966131 1.015160 -0.329008
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15 0.058956 0.242391 -0.660237 -0.081101 1.690625 0.296406 -0.938197 0.225710
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520

pandas.DataFrame.mask

df.mask((df - df.mean()).abs() > 2 * df.std())

A B C D E F G H
0 NaN -1.268466 NaN NaN -0.349286 -0.026955 0.316236 0.348782
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
2 0.746839 -1.504157 0.611362 0.400219 -0.959443 NaN -0.346508 -1.471558
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
7 -0.728821 NaN 1.402702 -0.017209 -0.071309 -0.533061 1.273899 0.348510
8 -0.920391 0.348579 -0.835074 -0.225377 0.206295 -0.582825 -1.511850 NaN
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
13 NaN 0.668158 0.463331 -0.302869 0.226355 -0.966131 1.015160 -0.329008
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
15 0.058956 0.242391 -0.660237 -0.081101 NaN 0.296406 -0.938197 0.225710
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520

+ dropna

If you only want rows for which no outliers exist for any column, you could follow up the above with dropna

df.mask((df - df.mean()).abs() > 2 * df.std()).dropna()



A B C D E F G H
1 0.715364 0.770763 -0.608208 0.352390 -0.352521 -0.415869 -0.911575 -0.142538
3 1.063243 1.062997 0.591860 0.296212 -0.774732 0.831452 1.486976 0.256220
4 -0.899906 0.375085 -0.519501 0.050101 0.949959 -1.033773 0.948247 0.733776
5 1.236118 0.155475 -1.341267 0.162864 1.258253 0.778040 1.341599 -1.636039
6 -0.195368 0.131820 2.069013 0.048729 -1.500564 0.907342 0.029326 0.066119
9 0.403321 0.992765 0.025249 -1.664999 -1.558044 -0.361630 -1.784971 -0.318569
10 -0.326400 -0.688203 0.506420 -0.386706 -0.368351 -0.293383 -2.086973 -0.807873
11 0.068855 -0.525141 0.745524 0.911930 -0.277785 -0.866313 1.155518 1.421480
12 1.416653 -0.120607 1.367540 -0.811585 -0.205071 -0.450472 -0.993868 -0.084107
14 -1.070002 0.525867 0.616915 0.399136 -0.233075 -0.482919 -1.018142 -1.673869
16 -0.352254 0.170126 -0.943541 0.627847 -0.948773 0.126131 1.162792 -0.492266
17 -0.444413 -0.028003 -0.286051 0.895515 -0.234507 1.005886 -1.350465 -0.959034
18 0.992524 -1.471428 0.270001 -1.197004 -0.324760 -1.383568 0.838075 -1.125205
19 0.024837 0.238895 0.350742 -0.541868 -0.730284 0.113695 0.068872 -0.032520

Detect Outliers across all columns of Pandas Dataframe

It looks like I just had to change my function in put and iterate over each column of the dataframe to do the trick:

def find_outliers(col):
q1 = col.quantile(.25)
q3 = col.quantile(.75)
IQR = q3 - q1
ll = q1 - (1.5*IQR)
ul = q3 + (1.5*IQR)
upper_outliers = col[col > ul].index.tolist()
lower_outliers = col[col < ll].index.tolist()
bad_indices = list(set(upper_outliers + lower_outliers))
return(bad_indices)

import numpy as np
bad_indexes = []
for col in df.columns:
if df[col].dtype in ["int64","float64"]:
bad_indexes.append(find_outliers(df[col]))

bad_indexes = set(list(np.concatenate(bad_indexes).flat))
print(len(bad_indexes))

Eliminate outliers in a dataframe with different dtypes - Pandas

Try using select_dtypes to get all columns from df of a particular type.

To select all numeric types, use np.number or 'number'

new_df = df[
(np.abs(stats.zscore(df.select_dtypes(include=np.number))) < 3).all(axis=1)
]

How to identify and highlight outliers in each row of a pandas dataframe

This works for me

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16, 5)),
columns=list('ABCDE')
)


mask = pd.DataFrame(abs(stats.zscore(df)) > 1, columns=df.columns)
df["Count"] = mask.sum(axis=1)
mask["Count"] = False
style_df = mask.applymap(lambda x: "background-color: red" if x else "")

sheet_name = "Values"
with pd.ExcelWriter("score_test.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,
sheet_name=sheet_name,
index=False)

Here the mask is the boolean conditional where we have true if zscore exceeds the limit. Based on this boolean mask I create a string dataframe 'style_df' with the values 'background: red' on the deviating cells. The values of the style_df is imposed with the last statement on the style of the df dataframe.

The resulting excel file looks now like this
contents of excel file

Remove Outliers in Pandas DataFrame using Percentiles

The initial dataset.

print(df.head())

Col0 Col1 Col2 Col3 Col4 User_id
0 49 31 93 53 39 44
1 69 13 84 58 24 47
2 41 71 2 43 58 64
3 35 56 69 55 36 67
4 64 24 12 18 99 67

First removing the User_id column

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)

Col0 Col1 Col2 Col3 Col4
0.05 2.00 3.00 6.9 3.95 4.00
0.95 95.05 89.05 93.0 94.00 97.05

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
(x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_id back.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)
print(filt_df.head())

User_id Col0 Col1 Col2 Col3 Col4
1 47 69 13 84 58 24
3 67 35 56 69 55 36
5 9 95 79 44 45 69
6 83 69 41 66 87 6
9 87 50 54 39 53 40
Checking result

print(filt_df.head())

User_id Col0 Col1 Col2 Col3 Col4
0 44 49 31 NaN 53 39
1 47 69 13 84 58 24
2 64 41 71 NaN 43 58
3 67 35 56 69 55 36
4 67 64 24 12 18 NaN

print(filt_df.describe())

User_id Col0 Col1 Col2 Col3 Col4
count 100.000000 89.000000 88.000000 88.000000 89.000000 89.000000
mean 48.230000 49.573034 45.659091 52.727273 47.460674 57.157303
std 28.372292 25.672274 23.537149 26.509477 25.823728 26.231876
min 0.000000 3.000000 5.000000 7.000000 4.000000 5.000000
25% 23.000000 29.000000 29.000000 29.500000 24.000000 36.000000
50% 47.000000 50.000000 40.500000 52.500000 49.000000 59.000000
75% 74.250000 69.000000 67.000000 75.000000 70.000000 79.000000
max 99.000000 95.000000 89.000000 92.000000 91.000000 97.000000
How to generate the test dataset

np.random.seed(0)
nb_sample = 100
num_sample = (0,100)

d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)

df = DataFrame.from_dict(d)

Automating removing outliers from a pandas dataframe using IQR as the parameter and putting the variables in a list

Based on comments on the original post, I suggest you do the following and revamp your solution.

I believe this answer provides a quick solution to your problem, so remember to search on SO before posting. This will remove all rows where one (or more) of the wanted column values is an outlier.

cols = ['pdays', 'campaign', 'previous'] # The columns you want to search for outliers in

# Calculate quantiles and IQR
Q1 = dummy_df[cols].quantile(0.25) # Same as np.percentile but maps (0,1) and not (0,100)
Q3 = dummy_df[cols].quantile(0.75)
IQR = Q3 - Q1

# Return a boolean array of the rows with (any) non-outlier column values
condition = ~((dummy_df[cols] < (Q1 - 1.5 * IQR)) | (dummy_df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)

# Filter our dataframe based on condition
filtered_df = dummy_df[condition]


Related Topics



Leave a reply



Submit