Python: sort function breaks in the presence of nan
The previous answers are useful, but perhaps not clear regarding the root of the problem.
In any language, sort applies a given ordering, defined by a comparison function or in some other way, over the domain of the input values. For example, less-than, a.k.a. operator <,
could be used throughout if and only if less than defines a suitable ordering over the input values.
But this is specifically NOT true for floating point values and less-than:
"NaN is unordered: it is not equal to, greater than, or less than anything, including itself." (Clear prose from GNU C manual, but applies to all modern IEEE754
based floating point)
So the possible solutions are:
- remove the NaNs first, making the input domain well defined via <
(or the other sorting function being used)- define a custom comparison function (a.k.a. predicate) that does
define an ordering for NaN, such as less than any number, or greater
than any number.
Either approach can be used, in any language.
Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context.
Otherwise you could use a suitable predicate function via "cmp" in older python versions, or via this and functools.cmp_to_key()
. The latter is a bit more awkward, naturally, than removing the NaNs first. And care will be required to avoid worse performance, when defining this predicate function.
sort function breaks in the presence of NaN, except when it doesn't
Tuple comparison assumes that the <
/==
/>
operations on their elements are a valid weak ordering, and it compares elements with PyObject_RichCompareBool
, which assumes that x is y
implies x == y
. When you use the same NaN object for all tuples, PyObject_RichCompareBool
thinks the NaNs are equal.
Python: sort function breaks in the presence of nan
The previous answers are useful, but perhaps not clear regarding the root of the problem.
In any language, sort applies a given ordering, defined by a comparison function or in some other way, over the domain of the input values. For example, less-than, a.k.a. operator <,
could be used throughout if and only if less than defines a suitable ordering over the input values.
But this is specifically NOT true for floating point values and less-than:
"NaN is unordered: it is not equal to, greater than, or less than anything, including itself." (Clear prose from GNU C manual, but applies to all modern IEEE754
based floating point)
So the possible solutions are:
- remove the NaNs first, making the input domain well defined via <
(or the other sorting function being used)- define a custom comparison function (a.k.a. predicate) that does
define an ordering for NaN, such as less than any number, or greater
than any number.
Either approach can be used, in any language.
Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context.
Otherwise you could use a suitable predicate function via "cmp" in older python versions, or via this and functools.cmp_to_key()
. The latter is a bit more awkward, naturally, than removing the NaNs first. And care will be required to avoid worse performance, when defining this predicate function.
Null and Sorting
If the Null
object implements comparison behavior, other methods (such as indexing) will get more complicated. Consider:
target = table.sql('select * where sales < 1000.00')
If Null
values compare < all other objects then target
could have rows where there were no sales (which is not the goal).
So, I think practicality and purity are both coming down on the same side on this one: Null comparisons yield unknown. Users will have to decide what to do with Null values if they get them.
Why doesn't NaN raise any errors in python?
Membership testing
Two different instances of float('nan')
are not equal to each other. They are "Not a Number" so it makes sense that they shouldn't also have to be equal. They are different instances of objects which are not numbers:
print(float('nan') == float('nan')) # False
As documented here:
For container types such as list, tuple, set, frozenset, dict, or
collections.deque, the expression x in y is equivalent to any(x is e
or x == e for e in y).
There is a checking for identity! that's why you see that behavior in your question and why NaN in a
returns True
and float('nan') in a
doesn't.
Sorting in Python
Python uses the Timsort algorithm for its sorted()
function. (Also see this for a textual explanation.) I'm not going to go into that. I just want to demonstrate a simple example:
This is my class A
. It's going to be our float('nan')
object. It acts like float('nan')
in that it returns False
for all comparison operations:
class A:
def __init__(self, n):
self.n = n
def __lt__(self, other):
print(self, 'lt is calling', other)
return False
def __gt__(self, other):
print(self, 'gt is calling', other)
return False
def __repr__(self):
return f'A({self.n})'
class B:
def __init__(self, n):
self.n = n
def __lt__(self, other):
print(self, 'lt is calling', other)
return False
def __gt__(self, other):
print(self, 'gt is calling', other)
return False
def __repr__(self):
return f'B({self.n})'
When we use the sorted()
function (or the .sort()
method of a list
) without the reverse=True
argument, we're asking for the iterable to be sorted in ascending order. To do this, Python tries to call the __lt__
method successively, starting from the second object in the list to see if it is less than its previous object and so on:
lst = [A(1), B(2), A(3), B(4)]
print(sorted(lst))
output :
B(2) lt is calling A(1)
A(3) lt is calling B(2)
B(4) lt is calling A(3)
[A(1), B(2), A(3), B(4)]
Now, switching back to your example:
lst = [3, A(1), 4, 2, A(1), 1]
print(sorted(lst))
output:
A(1) lt is calling 3
A(1) gt is calling 4
A(1) gt is calling 2
A(1) lt is calling 2
A(1) lt is calling 4
A(1) gt is calling 1
[3, A(1), 1, 2, 4, A(1)]
A(1).__lt__(3)
will returnFalse
. This meansA(1)
is not less
than 3 or This means3
is in correct position relative toA(1)
.- Then here
int.__lt__(4, A(1))
gets called and because it returnsNotImplemented
object, Python checks to see ifA(1)
has
implemented__gt__
and yes, soA(1).__gt__(4)
will returnFalse
again and this means theA(1)
object is in correct place
relative to4
. - (Etc.)
This is why the result of sorted()
seems to be weird, but it's predictable. A(1)
object in both cases, I mean when int
class returns NotImplemented
and when __lt__
gets called from A(1)
, will return False.
It's better to check the Timsort algorithm and consider those points. I would include the remaining steps if I read Timsort algorithm carefully.
How to sort a list of lists by a specific index of the inner list?
This is a job for itemgetter
>>> from operator import itemgetter
>>> L=[[0, 1, 'f'], [4, 2, 't'], [9, 4, 'afsd']]
>>> sorted(L, key=itemgetter(2))
[[9, 4, 'afsd'], [0, 1, 'f'], [4, 2, 't']]
It is also possible to use a lambda function here, however the lambda function is slower in this simple case
Related Topics
How to Normalize a Url in Python
Get Loop Count Inside a For-Loop
How to Create a Spinning Command Line Cursor
Take Multiple Lists into Dataframe
Matplotlib Figure Facecolor (Background Color)
Numpy.Where() Detailed, Step-By-Step Explanation/Examples
How to "Zip Sort" Parallel Numpy Arrays
MySQL "Incorrect String Value" Error When Save Unicode String in Django
Iterating Through Two Lists in Django Templates
Possibilities for Python Classes Organized Across Files
Python Module to Change System Date and Time
How to Print the Key-Value Pairs of a Dictionary in Python
Preprocessing in Scikit Learn - Single Sample - Depreciation Warning
Nltk Naivebayesclassifier Training for Sentiment Analysis