Python: Sort Function Breaks in the Presence of Nan

Python: sort function breaks in the presence of nan

The previous answers are useful, but perhaps not clear regarding the root of the problem.

In any language, sort applies a given ordering, defined by a comparison function or in some other way, over the domain of the input values. For example, less-than, a.k.a. operator <, could be used throughout if and only if less than defines a suitable ordering over the input values.

But this is specifically NOT true for floating point values and less-than:
"NaN is unordered: it is not equal to, greater than, or less than anything, including itself." (Clear prose from GNU C manual, but applies to all modern IEEE754 based floating point)

So the possible solutions are:

remove the NaNs first, making the input domain well defined via <
(or the other sorting function being used)

define a custom comparison function (a.k.a. predicate) that does
define an ordering for NaN, such as less than any number, or greater
than any number.

Either approach can be used, in any language.

Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context.

Otherwise you could use a suitable predicate function via "cmp" in older python versions, or via this and functools.cmp_to_key(). The latter is a bit more awkward, naturally, than removing the NaNs first. And care will be required to avoid worse performance, when defining this predicate function.

sort function breaks in the presence of NaN, except when it doesn't

Tuple comparison assumes that the </==/> operations on their elements are a valid weak ordering, and it compares elements with PyObject_RichCompareBool, which assumes that x is y implies x == y. When you use the same NaN object for all tuples, PyObject_RichCompareBool thinks the NaNs are equal.

Python: sort function breaks in the presence of nan

The previous answers are useful, but perhaps not clear regarding the root of the problem.

So the possible solutions are:

remove the NaNs first, making the input domain well defined via <
(or the other sorting function being used)

define a custom comparison function (a.k.a. predicate) that does
define an ordering for NaN, such as less than any number, or greater
than any number.

Either approach can be used, in any language.

Practically, considering python, I would prefer to remove the NaNs if you either don't care much about fastest performance or if removing NaNs is a desired behavior in context.

Null and Sorting

If the Null object implements comparison behavior, other methods (such as indexing) will get more complicated. Consider:

target = table.sql('select * where sales < 1000.00')

If Null values compare < all other objects then target could have rows where there were no sales (which is not the goal).

So, I think practicality and purity are both coming down on the same side on this one: Null comparisons yield unknown. Users will have to decide what to do with Null values if they get them.

Why doesn't NaN raise any errors in python?

Membership testing

Two different instances of float('nan') are not equal to each other. They are "Not a Number" so it makes sense that they shouldn't also have to be equal. They are different instances of objects which are not numbers:

print(float('nan') == float('nan'))  # False

As documented here:

For container types such as list, tuple, set, frozenset, dict, or
collections.deque, the expression x in y is equivalent to any(x is e
or x == e for e in y).

There is a checking for identity! that's why you see that behavior in your question and why NaN in a returns True and float('nan') in a doesn't.

Sorting in Python

Python uses the Timsort algorithm for its sorted() function. (Also see this for a textual explanation.) I'm not going to go into that. I just want to demonstrate a simple example:

This is my class A. It's going to be our float('nan') object. It acts like float('nan') in that it returns False for all comparison operations:

class A:
    def __init__(self, n):
        self.n = n

    def __lt__(self, other):
        print(self, 'lt is calling', other)
        return False

    def __gt__(self, other):
        print(self, 'gt is calling', other)
        return False

    def __repr__(self):
        return f'A({self.n})'

class B:
    def __init__(self, n):
        self.n = n

    def __lt__(self, other):
        print(self, 'lt is calling', other)
        return False

    def __gt__(self, other):
        print(self, 'gt is calling', other)
        return False

    def __repr__(self):
        return f'B({self.n})'

When we use the sorted() function (or the .sort() method of a list) without the reverse=True argument, we're asking for the iterable to be sorted in ascending order. To do this, Python tries to call the __lt__ method successively, starting from the second object in the list to see if it is less than its previous object and so on:

lst = [A(1), B(2), A(3), B(4)]
print(sorted(lst))

output :

B(2) lt is calling A(1)
A(3) lt is calling B(2)
B(4) lt is calling A(3)
[A(1), B(2), A(3), B(4)]

Now, switching back to your example:

lst = [3, A(1), 4, 2, A(1), 1]
print(sorted(lst))

output:

A(1) lt is calling 3
A(1) gt is calling 4
A(1) gt is calling 2
A(1) lt is calling 2
A(1) lt is calling 4
A(1) gt is calling 1
[3, A(1), 1, 2, 4, A(1)]

A(1).__lt__(3) will return False. This means A(1) is not less
than 3 or This means 3 is in correct position relative to A(1).
Then here int.__lt__(4, A(1)) gets called and because it returns
NotImplemented object, Python checks to see if A(1) has
implemented __gt__ and yes, so A(1).__gt__(4) will return
False again and this means the A(1) object is in correct place
relative to 4.
(Etc.)

This is why the result of sorted() seems to be weird, but it's predictable. A(1) object in both cases, I mean when int class returns NotImplemented and when __lt__ gets called from A(1), will return False.

It's better to check the Timsort algorithm and consider those points. I would include the remaining steps if I read Timsort algorithm carefully.

How to sort a list of lists by a specific index of the inner list?

This is a job for itemgetter

>>> from operator import itemgetter
>>> L=[[0, 1, 'f'], [4, 2, 't'], [9, 4, 'afsd']]
>>> sorted(L, key=itemgetter(2))
[[9, 4, 'afsd'], [0, 1, 'f'], [4, 2, 't']]

It is also possible to use a lambda function here, however the lambda function is slower in this simple case

Python: Sort Function Breaks in the Presence of Nan

Python: sort function breaks in the presence of nan

sort function breaks in the presence of NaN, except when it doesn't

Python: sort function breaks in the presence of nan

Null and Sorting

Why doesn't NaN raise any errors in python?

How to sort a list of lists by a specific index of the inner list?

Related Topics

Leave a reply