Order' of Unordered Python Sets

order' of unordered Python sets

You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).

Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.

Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.

For example:

list1 = [8,16,24]
set(list1) #set([8, 16, 24])
list2 = [24,16,8]
set(list2) #set([24, 16, 8])

Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".

If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:

list1 = [1,2,3]
set(list1) # set([1, 2, 3])
list2 = [3,2,1]
set(list2) # set([1, 2, 3])

since the last 3 bits of hash(1), hash(2) and hash(3) are unique.


1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.

Why are python sets sorted in ascending order?

The order correlates to the hash of the object, size of the set, binary representation of the number, insertion order and other implementation parameters. It is completely arbitrary and shouldn't be relied upon:

>>> st = {3, 1, 2,4,9,124124,124124124124,123,12,41,15,}
>>> st
{1, 2, 3, 4, 9, 41, 12, 15, 124124, 123, 124124124124}
>>> st.pop()
1
>>> st.pop()
2
>>> st.pop()
3
>>> st.pop()
4
>>> st.pop()
9
>>> st.pop()
41
>>> st.pop()
12
>>> {1, 41, 12}
{1, 12, 41}
>>> {1, 9, 41, 12}
{1, 12, 9, 41} # Looks like 9 wants to go after 12.
>>> hash(9)
9
>>> hash(12)
12
>>> hash(41)
41
>>> {1, 2, 3, 4, 9, 41, 12}
{1, 2, 3, 4, 9, 12, 41} # 12 before 41
>>> {1, 2, 3, 4, 9, 41, 12, 15} # add 15 at the end
{1, 2, 3, 4, 9, 41, 12, 15} # 12 after 41

Why does a set display in same order if sets are unordered?

They are not randomly ordered, they are arbitrarily ordered. It means you should not count on the order of insertions being maintained as the actual internal implementation details determine the order instead.

The order depends on the insertion and deletion history of the set.

In CPython, sets use a hash table, where inserted values are slotted into a sparse table based on the value returned from the hash() function, modulo the table size and a collision handling algorithm. Listing the set contents then returns the values as ordered in this table.

If you want to go into the nitty-gritty technical details then look at Why is the order in dictionaries and sets arbitrary?; sets are, at their core, dictionaries where the keys are the set values and there are no associated dictionary values. The actual implementation is a little more complicated, as always, but that answer will suffice to get you most of the way there. Then look at the C source code for set for the rest of those details.

Compare this to lists, which do have a fixed order that you can influence; you can move items around in the list and the new ordering would be maintained for you.

Sets are sorted from 0-9 every single time in python?! Not unordered

Those ints hash to themselves:

>>> [*map(hash, range(10))]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

When you add the numbers 0 to 9 to a set, the set makes room for at least 10 numbers (actually 32, I think). So its internal array has at least the indexes 0 to 9. And because those numbers hash to themselves, they're stored in the set's internal array at their own index (value i gets stored at index hash(i)=i). So when you iterate it, you get them sorted.

Further illustration with smaller examples:

Sets start with internal size 8, and value i wants to go to index hash(i) % 8. So if you add 0 and 8, both want to go to index 0. The one that comes first actually gets to index 0, the other has to go to some other (larger) index. Hence:

>>> {0, 8}, {8, 0}
({0, 8}, {8, 0})

If you instead add 1 and 8, then 1 wants to go to index 1 and 8 wants to go to index 0, so 8 always comes first regardless of insertion order:

>>> {1, 8}, {8, 1}
({8, 1}, {8, 1})

An example with 0 to 9:

>>> s = set()
>>> for i in 8, 9, 0, 1, 2, 3, 4, 5, 6, 7:
s.add(i)
print(s)

{8} # the only element (stored at index 0)
{8, 9} # 9 gets stored at index 1, so after 8
{8, 9, 0} # indices 0 and 1 are already taken, so 0 goes to some higher index
{8, 9, 0, 1} # similar
{0, 1, 2, 8, 9} # the set internally resized and re-added all values, each
# value ends up at its own index (e.g., 8 goes to index 8)
{0, 1, 2, 3, 8, 9} # 3 goes to index 3
{0, 1, 2, 3, 4, 8, 9} # same for the rest, all go to their own index...
{0, 1, 2, 3, 4, 5, 8, 9}
{0, 1, 2, 3, 4, 5, 6, 8, 9}
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

Why don't Python sets preserve insertion order?

Sets and dicts are optimized for different use-cases. The primary use of a set is fast membership testing, which is order agnostic. For dicts, cost of the lookup is the most critical operation, and the key is more likely to be present. With sets, the presence or absence of an element is not known in advance, and so the set implementation needs to optimize for both the found and not-found case. Also, some optimizations for common set operations such as union and intersection make it difficult to retain set ordering without degrading performance.

While both data structures are hash based, it's a common misconception that sets are just implemented as dicts with null values. Even before the compact dict implementation in CPython 3.6, the set and dict implementations already differed significantly, with little code reuse. For example, dicts use randomized probing, but sets use a combination of linear probing and open addressing, to improve cache locality. The initial linear probe (default 9 steps in CPython) will check a series of adjacent key/hash pairs, improving performance by reducing the cost of hash collision handling - consecutive memory access is cheaper than scattered probes.

  • dictobject.c - master, v3.5.9
  • setobject.c - master, v3.5.9
  • issue18771 - changeset to reduce the cost of hash collisions for set objects in Python 3.4.

It would be possible in theory to change CPython's set implementation to be similar to the compact dict, but in practice there are drawbacks, and notable core developers were opposed to making such a change.

Sets remain unordered. (Why? The usage patterns are different. Also, different implementation.)

– Guido van Rossum

Sets use a different algorithm that isn't as amendable to retaining insertion order.
Set-to-set operations lose their flexibility and optimizations if order is required. Set mathematics are defined in terms of unordered sets. In short, set ordering isn't in the immediate future.

– Raymond Hettinger

A detailed discussion about whether to compactify sets for 3.7, and why it was decided against, can be found in the python-dev mailing lists.

In summary, the main points are: different usage patterns (insertion ordering dicts such as **kwargs is useful, less so for sets), space savings for compacting sets are less significant (because there are only key + hash arrays to densify, as opposed to key + hash + value arrays), and the aforementioned linear probing optimization which sets currently use is incompatible with a compact implementation.

I will reproduce Raymond's post below which covers the most important points.

On Sep 14, 2016, at 3:50 PM, Eric Snow wrote:

Then, I'll do same to sets.

Unless I've misunderstood, Raymond was opposed to making a similar
change to set.

That's right. Here are a few thoughts on the subject before people
starting running wild.

  • For the compact dict, the space savings was a net win with the additional space consumed by the indices and the overallocation for
    the key/value/hash arrays being more than offset by the improved
    density of key/value/hash arrays. However for sets, the net was much
    less favorable because we still need the indices and overallocation
    but can only offset the space cost by densifying only two of the three
    arrays. In other words, compacting makes more sense when you have
    wasted space for keys, values, and hashes. If you lose one of those
    three, it stops being compelling.

  • The use pattern for sets is different from dicts. The former has more hit or miss lookups. The latter tends to have fewer missing key
    lookups. Also, some of the optimizations for the set-to-set operations
    make it difficult to retain set ordering without impacting
    performance.

  • I pursued alternative path to improve set performance. Instead of compacting (which wasn't much of space win and incurred the cost of an
    additional indirection), I added linear probing to reduce the cost of
    collisions and improve cache performance. This improvement is
    incompatible with the compacting approach I advocated for
    dictionaries.

  • For now, the ordering side-effect on dictionaries is non-guaranteed, so it is premature to start insisting the sets become ordered as well.
    The docs already link to a recipe for creating an OrderedSet (
    https://code.activestate.com/recipes/576694/ ) but it seems like the
    uptake has been nearly zero. Also, now that Eric Snow has given us a
    fast OrderedDict, it is easier than ever to build an OrderedSet from
    MutableSet and OrderedDict, but again I haven't observed any real
    interest because typical set-to-set data analytics don't really need
    or care about ordering. Likewise, the primary use of fast membership
    testings is order agnostic.

  • That said, I do think there is room to add alternative set implementations to PyPI. In particular, there are some interesting
    special cases for orderable data where set-to-set operations can be
    sped-up by comparing entire ranges of keys (see
    https://code.activestate.com/recipes/230113-implementation-of-sets-using-sorted-lists
    for a starting point). IIRC, PyPI already has code for set-like bloom
    filters and cuckoo hashing.

  • I understanding that it is exciting to have a major block of code accepted into the Python core but that shouldn't open to floodgates to
    engaging in more major rewrites of other datatypes unless we're sure
    that it is warranted.

– Raymond Hettinger

From [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered, Sept 2016.

Converting a list to a set changes element order

  1. A set is an unordered data structure, so it does not preserve the insertion order.

  2. This depends on your requirements. If you have an normal list, and want to remove some set of elements while preserving the order of the list, you can do this with a list comprehension:

    >>> a = [1, 2, 20, 6, 210]
    >>> b = set([6, 20, 1])
    >>> [x for x in a if x not in b]
    [2, 210]

    If you need a data structure that supports both fast membership tests and preservation of insertion order, you can use the keys of a Python dictionary, which starting from Python 3.7 is guaranteed to preserve the insertion order:

    >>> a = dict.fromkeys([1, 2, 20, 6, 210])
    >>> b = dict.fromkeys([6, 20, 1])
    >>> dict.fromkeys(x for x in a if x not in b)
    {2: None, 210: None}

    b doesn't really need to be ordered here – you could use a set as well. Note that a.keys() - b.keys() returns the set difference as a set, so it won't preserve the insertion order.

    In older versions of Python, you can use collections.OrderedDict instead:

    >>> a = collections.OrderedDict.fromkeys([1, 2, 20, 6, 210])
    >>> b = collections.OrderedDict.fromkeys([6, 20, 1])
    >>> collections.OrderedDict.fromkeys(x for x in a if x not in b)
    OrderedDict([(2, None), (210, None)])

Does Python have an ordered set?

There is an ordered set (possible new link) recipe for this which is referred to from the Python 2 Documentation. This runs on Py2.6 or later and 3.0 or later without any modifications. The interface is almost exactly the same as a normal set, except that initialisation should be done with a list.

OrderedSet([1, 2, 3])

This is a MutableSet, so the signature for .union doesn't match that of set, but since it includes __or__ something similar can easily be added:

@staticmethod
def union(*sets):
union = OrderedSet()
union.union(*sets)
return union

def union(self, *sets):
for set in sets:
self |= set

Unordered collection - sets in python

A random order is not unordered. Unordered means there is no defined way the data would be ordered i.e. the insertion order or the data does not have any correlation with how the data is arranged.

The reason the data is always in a predictable order because it so happened that the particular implementation have chosen to always arrange the elements in a manner such that the order of insertion dictates the data ordering. But, there is no guarantee# that would happen and we do see this deviating in Python 3.X dictionary implementation.

Note Even if we see that the data is sorted,

>>> {1,2,3,4,5}
set([1, 2, 3, 4, 5])

we would still call it unordered, unless the documents strictly says so and provides guarantee of its order or there may be surprises waiting for you. I have seen implementations which relied on the fact that sets and dictionaries maintained ordered based in insertion pattern. Such implementations has serious consequences when they were ported to Python 3.X.

#

What’s New In Python 3.3

Security improvements:
Hash randomization is switched on by default.

How can python iterate over a set if no order is defined?

A temporary order is used to iterate over the set, but you can't reliably predict it (practically speaking, as it depends on the insertion and deletion history of the set). If you need a specific order, use a list.



Related Topics



Leave a reply



Submit