Remove Outliers in Pandas DataFrame using Percentiles
The initial dataset.
print(df.head())
Col0 Col1 Col2 Col3 Col4 User_id
0 49 31 93 53 39 44
1 69 13 84 58 24 47
2 41 71 2 43 58 64
3 35 56 69 55 36 67
4 64 24 12 18 99 67
First removing the User_id
column
filt_df = df.loc[:, df.columns != 'User_id']
Then, computing percentiles.
low = .05
high = .95
quant_df = filt_df.quantile([low, high])
print(quant_df)
Col0 Col1 Col2 Col3 Col4
0.05 2.00 3.00 6.9 3.95 4.00
0.95 95.05 89.05 93.0 94.00 97.05
Next filtering values based on computed percentiles. To do that I use an apply
by columns and that's it !
filt_df = filt_df.apply(lambda x: x[(x>quant_df.loc[low,x.name]) &
(x < quant_df.loc[high,x.name])], axis=0)
Bringing the User_id
back.
filt_df = pd.concat([df.loc[:,'User_id'], filt_df], axis=1)
Last, rows with NaN
values can be dropped simply like this.
filt_df.dropna(inplace=True)
print(filt_df.head())
User_id Col0 Col1 Col2 Col3 Col4
1 47 69 13 84 58 24
3 67 35 56 69 55 36
5 9 95 79 44 45 69
6 83 69 41 66 87 6
9 87 50 54 39 53 40
Checking result
print(filt_df.head())
User_id Col0 Col1 Col2 Col3 Col4
0 44 49 31 NaN 53 39
1 47 69 13 84 58 24
2 64 41 71 NaN 43 58
3 67 35 56 69 55 36
4 67 64 24 12 18 NaN
print(filt_df.describe())
User_id Col0 Col1 Col2 Col3 Col4
count 100.000000 89.000000 88.000000 88.000000 89.000000 89.000000
mean 48.230000 49.573034 45.659091 52.727273 47.460674 57.157303
std 28.372292 25.672274 23.537149 26.509477 25.823728 26.231876
min 0.000000 3.000000 5.000000 7.000000 4.000000 5.000000
25% 23.000000 29.000000 29.000000 29.500000 24.000000 36.000000
50% 47.000000 50.000000 40.500000 52.500000 49.000000 59.000000
75% 74.250000 69.000000 67.000000 75.000000 70.000000 79.000000
max 99.000000 95.000000 89.000000 92.000000 91.000000 97.000000
How to generate the test dataset
np.random.seed(0)
nb_sample = 100
num_sample = (0,100)
d = dict()
d['User_id'] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
for i in range(5):
d['Col' + str(i)] = np.random.randint(num_sample[0], num_sample[1], nb_sample)
df = DataFrame.from_dict(d)
Removing outliers in a df containing mixed dtype
I would break the problem into stages:
Firstly, identify (numeric) columns you want to do the outlier removal.
Reference
newdf = df.select_dtypes(include=np.number)
Now perform whatever filtering/outlier removal you want on the rows of newdf
. Afterwards, newdf
should contain only rows you wish to retain.
Then keep only the rows of df
those index are in newdf
. Reference
df = df[df.index.isin(newdf.index)]
Automating removing outliers from a pandas dataframe using IQR as the parameter and putting the variables in a list
Based on comments on the original post, I suggest you do the following and revamp your solution.
I believe this answer provides a quick solution to your problem, so remember to search on SO before posting. This will remove all rows where one (or more) of the wanted column values is an outlier.
cols = ['pdays', 'campaign', 'previous'] # The columns you want to search for outliers in
# Calculate quantiles and IQR
Q1 = dummy_df[cols].quantile(0.25) # Same as np.percentile but maps (0,1) and not (0,100)
Q3 = dummy_df[cols].quantile(0.75)
IQR = Q3 - Q1
# Return a boolean array of the rows with (any) non-outlier column values
condition = ~((dummy_df[cols] < (Q1 - 1.5 * IQR)) | (dummy_df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)
# Filter our dataframe based on condition
filtered_df = dummy_df[condition]
How do I remove outliers from a dataframe that contains floating integers in Y-axis and dates in X-axis?
You can remove the outliers by comparing them to the mean or median (I suggest using the median). Divide the distance between each value and the median by the distance between the maximum and median values if it is greater than a threshold value (eg 0.98, It depends on your data and only you can select it) Delete that data.
For example, if you set your threshold to 1, the farthest data will be deleted.
Eliminate outliers in a dataframe with different dtypes - Pandas
Try using select_dtypes to get all columns from df
of a particular type.
To select all numeric types, use np.number or 'number'
new_df = df[
(np.abs(stats.zscore(df.select_dtypes(include=np.number))) < 3).all(axis=1)
]
Checking a Pandas Dataframe for Outliers
Scatter plots or distribution plots are good for pointing outliers. But in context to the question of pandas data frames here's how I would do it.
df.decribe()
Will give you a good matrix of mean, max, and all percentile. Look into the max of the column to point out the outlier if its greater than 75 percentile of values.
Then df['Sensor Value'].value_counts()
should give you the frequency of the values. You will have the outliers shown right here with greater values and that of less frequency.
Get their indexes and just drop them using df.drop(indexes_list, inplace=True)
EDIT:
You could also check outlier with mean +/- 3 * standard deviation
.
Example code:
outliers = df[df[col] > df[col].mean() + 3 * df[col].std()]
Related Topics
Get Spotify Currently Playing Track
How to Upload File with Python Requests
Using Property() on Classmethods
Copying Nested Lists in Python
Why Does This Code for Initializing a List of Lists Apparently Link the Lists Together
Differencebetween Dict.Items() and Dict.Iteritems() in Python2
Urllib2.Httperror: Http Error 403: Forbidden
Iterate an Iterator by Chunks (Of N) in Python
Append Multiple Pandas Data Frames at Once
Settingwithcopywarning Even When Using .Loc[Row_Indexer,Col_Indexer] = Value
What Does "The Following Packages Will Be Superseded by a Higher Priority Channel" Mean
Custom Sorting in Pandas Dataframe
Convert Base-2 Binary Number String to Int
How to Groupby Consecutive Values in Pandas Dataframe
Matplotlib: How to Create Axessubplot Objects, Then Add Them to a Figure Instance
What Is the Most Efficient Way to Loop Through Dataframes with Pandas
How to Preserve Timezone When Parsing Date/Time Strings with Strptime()
How Does Swapping of Members in Tuples (A,B)=(B,A) Work Internally