Detect and Exclude Outliers in a Pandas Dataframe

Remove Outliers in Pandas DataFrame using Percentiles

The initial dataset.

print(df.head())

Col0 Col1 Col2 Col3 Col4 User_id
0 49 31 93 53 39 44
1 69 13 84 58 24 47
2 41 71 2 43 58 64
3 35 56 69 55 36 67
4 64 24 12 18 99 67

First removing the User_id column

filt_df = df.loc[:, df.columns != 'User_id']

Then, computing percentiles.

low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)

Col0 Col1 Col2 Col3 Col4
0.05 2.00 3.00 6.9 3.95 4.00
0.95 95.05 89.05 93.0 94.00 97.05

Next filtering values based on computed percentiles. To do that I use an apply by columns and that's it !

filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) & 
(x < quant_df.loc[high,x.name])], axis=0)

Bringing the User_id back.

filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)

Last, rows with NaN values can be dropped simply like this.

filt_df.dropna(inplace=True)
print(filt_df.head())

User_id Col0 Col1 Col2 Col3 Col4
1 47 69 13 84 58 24
3 67 35 56 69 55 36
5 9 95 79 44 45 69
6 83 69 41 66 87 6
9 87 50 54 39 53 40

Checking result

print(filt_df.head())

User_id Col0 Col1 Col2 Col3 Col4
0 44 49 31 NaN 53 39
1 47 69 13 84 58 24
2 64 41 71 NaN 43 58
3 67 35 56 69 55 36
4 67 64 24 12 18 NaN

print(filt_df.describe())

User_id Col0 Col1 Col2 Col3 Col4
count 100.000000 89.000000 88.000000 88.000000 89.000000 89.000000
mean 48.230000 49.573034 45.659091 52.727273 47.460674 57.157303
std 28.372292 25.672274 23.537149 26.509477 25.823728 26.231876
min 0.000000 3.000000 5.000000 7.000000 4.000000 5.000000
25% 23.000000 29.000000 29.000000 29.500000 24.000000 36.000000
50% 47.000000 50.000000 40.500000 52.500000 49.000000 59.000000
75% 74.250000 69.000000 67.000000 75.000000 70.000000 79.000000
max 99.000000 95.000000 89.000000 92.000000 91.000000 97.000000

How to generate the test dataset

np.random.seed(0)
nb_sample = 100
num_sample = (0,100)

d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)

df = DataFrame.from_dict(d)

Removing outliers in a df containing mixed dtype

I would break the problem into stages:

Firstly, identify (numeric) columns you want to do the outlier removal.
Reference

newdf = df.select_dtypes(include=np.number)

Now perform whatever filtering/outlier removal you want on the rows of newdf. Afterwards, newdf should contain only rows you wish to retain.

Then keep only the rows of df those index are in newdf. Reference

df = df[df.index.isin(newdf.index)]

Automating removing outliers from a pandas dataframe using IQR as the parameter and putting the variables in a list

Based on comments on the original post, I suggest you do the following and revamp your solution.

I believe this answer provides a quick solution to your problem, so remember to search on SO before posting. This will remove all rows where one (or more) of the wanted column values is an outlier.

cols = ['pdays', 'campaign', 'previous'] # The columns you want to search for outliers in

# Calculate quantiles and IQR
Q1 = dummy_df[cols].quantile(0.25) # Same as np.percentile but maps (0,1) and not (0,100)
Q3 = dummy_df[cols].quantile(0.75)
IQR = Q3 - Q1

# Return a boolean array of the rows with (any) non-outlier column values
condition = ~((dummy_df[cols] < (Q1 - 1.5 * IQR)) | (dummy_df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)

# Filter our dataframe based on condition
filtered_df = dummy_df[condition]

How do I remove outliers from a dataframe that contains floating integers in Y-axis and dates in X-axis?

You can remove the outliers by comparing them to the mean or median (I suggest using the median). Divide the distance between each value and the median by the distance between the maximum and median values if it is greater than a threshold value (eg 0.98, It depends on your data and only you can select it) Delete that data.
For example, if you set your threshold to 1, the farthest data will be deleted.

Eliminate outliers in a dataframe with different dtypes - Pandas

Try using select_dtypes to get all columns from df of a particular type.

To select all numeric types, use np.number or 'number'

new_df = df[
(np.abs(stats.zscore(df.select_dtypes(include=np.number))) < 3).all(axis=1)
]

Checking a Pandas Dataframe for Outliers

Scatter plots or distribution plots are good for pointing outliers. But in context to the question of pandas data frames here's how I would do it.

df.decribe()

Will give you a good matrix of mean, max, and all percentile. Look into the max of the column to point out the outlier if its greater than 75 percentile of values.

Then df['Sensor Value'].value_counts()should give you the frequency of the values. You will have the outliers shown right here with greater values and that of less frequency.

Get their indexes and just drop them using df.drop(indexes_list, inplace=True)

EDIT:
You could also check outlier with mean +/- 3 * standard deviation.

Example code:

outliers = df[df[col] > df[col].mean() + 3 * df[col].std()]


Related Topics



Leave a reply



Submit