pandas: qcut error: ValueError: Bin edges must be unique:
The problem is pandas.qcut chooses the bins so that you have the same number of records in each bin/quantile, but the same value cannot fall in multiple bins/quantiles.
Here is a list of solutions.
qcut with non-unique bin edges produces wrong number of quantiles
What you might be looking for is a way to construct the "quantiles" yourself. You can do this by sorting and then using integer division to define the group.
I'll create data with an excessive mass at 0, such that pd.qcut
will complain about duplicates.
import pandas as pd
import numpy as np
np.random.seed(410012)
s = pd.Series(np.random.normal(0, 4, 1000))
s = pd.concat([s, pd.Series([0]*500)])
s = s.to_frame('vals')
N = 10
s = s.sort_values('vals')
s['q'] = np.arange(len(s)) // (len(s)/N)
With q we now get 10 bins regardless.
s.groupby('q').describe()
# vals
# count mean std min 25% 50% 75% max
#q
#0.0 150.0 -6.5934 1.9208 -12.6041 -7.7703 -6.1546 -5.1073 -4.3421
#1.0 150.0 -3.1922 0.5621 -4.3287 -3.6605 -3.1293 -2.7377 -2.2718
#2.0 150.0 -1.4932 0.4203 -2.2561 -1.8196 -1.5262 -1.1364 -0.7451
#3.0 150.0 -0.1831 0.2400 -0.7425 -0.3371 -0.0110 0.0000 0.0000
#4.0 150.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
#5.0 150.0 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
#6.0 150.0 0.0238 0.0678 0.0000 0.0000 0.0000 0.0000 0.2856
#7.0 150.0 1.1555 0.4833 0.3353 0.7615 1.1837 1.5819 1.9513
#8.0 150.0 2.9430 0.6016 1.9660 2.4385 2.9665 3.4764 4.0277
#9.0 150.0 6.1692 1.6616 4.0336 4.8805 5.8176 6.9019 12.3437
The bins that don't overlap the problematic value are identical but the two bins where 0 is the edge are different (because they have been collapsed)
s.groupby(pd.qcut(s['vals'], 10, duplicates='drop'))['vals'].describe()
# count mean std min 25% 50% 75% max
#vals
#(-12.604999999999999, -4.33] 150.0 -6.5934 1.9208 -12.6041 -7.7703 -6.1546 -5.1073 -4.3421
#(-4.33, -2.259] 150.0 -3.1922 0.5621 -4.3287 -3.6605 -3.1293 -2.7377 -2.2718
#(-2.259, -0.743] 150.0 -1.4932 0.4203 -2.2561 -1.8196 -1.5262 -1.1364 -0.7451
#(-0.743, 0.0] 576.0 -0.0477 0.1463 -0.7425 0.0000 0.0000 0.0000 0.0000
#(0.0, 0.301] 24.0 0.1490 0.1016 0.0024 0.0457 0.1497 0.2485 0.2856
#(0.301, 1.954] 150.0 1.1555 0.4833 0.3353 0.7615 1.1837 1.5819 1.9513
#(1.954, 4.028] 150.0 2.9430 0.6016 1.9660 2.4385 2.9665 3.4764 4.0277
#(4.028, 12.344] 150.0 6.1692 1.6616 4.0336 4.8805 5.8176 6.9019 12.3437
Why use pandas qcut return ValueError: Bin edges must be unique?
I ran this in Jupyter and placed the exampledata.txt to the same directory as the notebook.
Please note that the first line:
df = pd.DataFrame(datas, columns=['userid', 'recency', 'frequency', 'monetary'])
loads the colums 'userid'
when it isn't defined in the data file. I removed this column name.
Solution
import pandas as pd
def pct_rank_qcut(series, n):
edges = pd.Series([float(i) / n for i in range(n + 1)])
f = lambda x: (edges >= x).argmax()
return series.rank(pct=1).apply(f)
datas = pd.read_csv('./exampledata.txt', delimiter=';')
df = pd.DataFrame(datas, columns=['recency', 'frequency', 'monetary'])
df['recency'] = df['recency'].astype(float)
df['frequency'] = df['frequency'].astype(float)
df['monetary'] = df['monetary'].astype(float)
df['recency'] = pct_rank_qcut(df.recency, 5)
df['frequency'] = pct_rank_qcut(df.frequency, 5)
df['monetary'] = pct_rank_qcut(df.monetary, 5)
Explanation
The problem you were seeing was a result of pd.qcut assuming 5 bins of equal size. In the data you provided, 'frequency'
has more than 28% number 1's. This broke qcut
.
I provided a new function pct_rank_qcut
that addresses this and pushes all 1's into the first bin.
edges = pd.Series([float(i) / n for i in range(n + 1)])
This line defines a series of percentile edges based on the desired number of bins defined by n
. In the case of n = 5
the edges will be [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]
f = lambda x: (edges >= x).argmax()
this line defines a helper function to be applied to another series in the next line. edges >= x
will return a series equal in length to edges
where each element is True
or False
depending on whether x
is less than or equal to that edge. In the case of x = 0.14
the resulting (edges >= x)
will be [False, True, True, True, True, True]
. By the taking the argmax()
I've identified the first index where the series is True
, in this case 1
.
return series.rank(pct=1).apply(f)
This line takes the input series
and turns it into a percentile ranking. I can compare these rankings to the edges I've created and that's why I use the apply(f)
. What's returned should be a series of bin numbers numbered 1 to n. This series of bin numbers is the same thing you were trying to get with:
pd.qcut(df['recency'].values, 5).codes + 1
This has consequences in that the bins are no longer equal and that bin 1 borrows completely from bin 2. But some choice had to be made. If you don't like this choice, use the concept to build your own ranking.
Demonstration
print df.head()
recency frequency monetary
0 3 5 5
1 2 5 5
2 2 5 5
3 1 5 5
4 2 5 5
Update
pd.Series.argmax()
is now deprecated. Simply switch to pd.Series.values.argmax()()
to update!
def pct_rank_qcut(series, n):
edges = pd.Series([float(i) / n for i in range(n + 1)])
f = lambda x: (edges >= x).values.argmax()
return series.rank(pct=1).apply(f)
pd.qcut - ValueError: Bin edges must be unique
Using the solution in the post https://stackoverflow.com/a/36883735/2336654
def pct_rank_qcut(series, n):
edges = pd.Series([float(i) / n for i in range(n + 1)])
f = lambda x: (edges >= x).argmax()
return series.rank(pct=1).apply(f)
q = pct_rank_qcut(df.loss_percent, 10)
I have been trying to qcut an array of values into 4 bins. I am getting the error below? How to solve this I am a beginner in Python
qcut
is not friendly with duplicated data and will throw an error when it sees a duplicate at splitting point. Imagine you do a qcut
on [1]*100
, what is the 50-th
percentile?
You can try rank(pct=True)
to calculate the actual percentile for the value, then cut
:
wkx_old['Rankings'] = pd.cut(wkx_old['Sales point'].rank(pct=True),
bins=4, labels=names)
Output:
0 C
1 C
2 C
3 B
4 B
..
119 A
120 C
121 C
122 A
123 D
Length: 124, dtype: category
Categories (4, object): ['D' < 'C' < 'B' < 'A']
Related Topics
Syntaxerror Inconsistency in Python
How to Randomly Choose a Maths Operator and Ask Recurring Maths Questions with It
Correct Way to Define Class Variables in Python
Failed to Catch Syntax Error Python
Installing Numpy on 64Bit Windows 7 with Python 2.7.3
Can Elementtree Be Told to Preserve the Order of Attributes
Standard Way to Embed Version into Python Package
What Do I Do When I Need a Self Referential Dictionary
Why Do "Not a Number" Values Equal True When Cast as Boolean in Python/Numpy
How to Move Pandas Data from Index to Column After Multiple Groupby
How to Get Python Requests to Trust a Self Signed Ssl Certificate
What Do I Use for a Max-Heap Implementation in Python
What's the Difference Between _Builtin_ and _Builtins_
Inheritance of Private and Protected Methods in Python
Convert Floating Point Number to a Certain Precision, and Then Copy to String