Are Sets Ordered Like Dicts in Python3.6

Are dictionaries ordered in Python 3.6+?

Are dictionaries ordered in Python 3.6+?

They are insertion ordered[1].

As of Python 3.6, for the CPython implementation of Python, dictionaries remember the order of items inserted. This is considered an implementation detail in Python 3.6; you need to use OrderedDict if you want insertion ordering that's guaranteed across other implementations of Python (and other ordered behavior[1]).

As of Python 3.7, this is a guaranteed language feature, not merely an implementation detail. From a python-dev message by GvR:

Make it so. "Dict keeps insertion order" is the ruling. Thanks!

This simply means that you can depend on it. Other implementations of Python must also offer an insertion ordered dictionary if they wish to be a conforming implementation of Python 3.7.



How does the Python 3.6 dictionary implementation perform better[2] than the older one while preserving element order?

Essentially, by keeping two arrays.

  • The first array, dk_entries, holds the entries (of type PyDictKeyEntry) for the dictionary in the order that they were inserted. Preserving order is achieved by this being an append only array where new items are always inserted at the end (insertion order).

  • The second, dk_indices, holds the indices for the dk_entries array (that is, values that indicate the position of the corresponding entry in dk_entries). This array acts as the hash table. When a key is hashed it leads to one of the indices stored in dk_indices and the corresponding entry is fetched by indexing dk_entries. Since only indices are kept, the type of this array depends on the overall size of the dictionary (ranging from type int8_t(1 byte) to int32_t/int64_t (4/8 bytes) on 32/64 bit builds)

In the previous implementation, a sparse array of type PyDictKeyEntry and size dk_size had to be allocated; unfortunately, it also resulted in a lot of empty space since that array was not allowed to be more than 2/3 * dk_size full for performance reasons. (and the empty space still had PyDictKeyEntry size!).

This is not the case now since only the required entries are stored (those that have been inserted) and a sparse array of type intX_t (X depending on dict size) 2/3 * dk_sizes full is kept. The empty space changed from type PyDictKeyEntry to intX_t.

So, obviously, creating a sparse array of type PyDictKeyEntry is much more memory demanding than a sparse array for storing ints.

You can see the full conversation on Python-Dev regarding this feature if interested, it is a good read.


In the original proposal made by Raymond Hettinger, a visualization of the data structures used can be seen which captures the gist of the idea.

For example, the dictionary:

d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}

is currently stored as [keyhash, key, value]:

entries = [['--', '--', '--'],
[-8522787127447073495, 'barry', 'green'],
['--', '--', '--'],
['--', '--', '--'],
['--', '--', '--'],
[-9092791511155847987, 'timmy', 'red'],
['--', '--', '--'],
[-6480567542315338377, 'guido', 'blue']]

Instead, the data should be organized as follows:

indices =  [None, 1, None, None, None, 0, None, 2]
entries = [[-9092791511155847987, 'timmy', 'red'],
[-8522787127447073495, 'barry', 'green'],
[-6480567542315338377, 'guido', 'blue']]

As you can visually now see, in the original proposal, a lot of space is essentially empty to reduce collisions and make look-ups faster. With the new approach, you reduce the memory required by moving the sparseness where it's really required, in the indices.




[1]: I say "insertion ordered" and not "ordered" since, with the existence of OrderedDict, "ordered" suggests further behavior that the `dict` object *doesn't provide*. OrderedDicts are reversible, provide order sensitive methods and, mainly, provide an order-sensive equality tests (`==`, `!=`). `dict`s currently don't offer any of those behaviors/methods.




[2]: The new dictionary implementations performs better **memory wise** by being designed more compactly; that's the main benefit here. Speed wise, the difference isn't so drastic, there's places where the new dict might introduce slight regressions (key-lookups, for example) while in others (iteration and resizing come to mind) a performance boost should be present.


Overall, the performance of the dictionary, especially in real-life situations, improves due to the compactness introduced.

Does Python have an ordered set?

There is an ordered set (possible new link) recipe for this which is referred to from the Python 2 Documentation. This runs on Py2.6 or later and 3.0 or later without any modifications. The interface is almost exactly the same as a normal set, except that initialisation should be done with a list.

OrderedSet([1, 2, 3])

This is a MutableSet, so the signature for .union doesn't match that of set, but since it includes __or__ something similar can easily be added:

@staticmethod
def union(*sets):
union = OrderedSet()
union.union(*sets)
return union

def union(self, *sets):
for set in sets:
self |= set

Why don't Python sets preserve insertion order?

Sets and dicts are optimized for different use-cases. The primary use of a set is fast membership testing, which is order agnostic. For dicts, cost of the lookup is the most critical operation, and the key is more likely to be present. With sets, the presence or absence of an element is not known in advance, and so the set implementation needs to optimize for both the found and not-found case. Also, some optimizations for common set operations such as union and intersection make it difficult to retain set ordering without degrading performance.

While both data structures are hash based, it's a common misconception that sets are just implemented as dicts with null values. Even before the compact dict implementation in CPython 3.6, the set and dict implementations already differed significantly, with little code reuse. For example, dicts use randomized probing, but sets use a combination of linear probing and open addressing, to improve cache locality. The initial linear probe (default 9 steps in CPython) will check a series of adjacent key/hash pairs, improving performance by reducing the cost of hash collision handling - consecutive memory access is cheaper than scattered probes.

  • dictobject.c - master, v3.5.9
  • setobject.c - master, v3.5.9
  • issue18771 - changeset to reduce the cost of hash collisions for set objects in Python 3.4.

It would be possible in theory to change CPython's set implementation to be similar to the compact dict, but in practice there are drawbacks, and notable core developers were opposed to making such a change.

Sets remain unordered. (Why? The usage patterns are different. Also, different implementation.)

– Guido van Rossum

Sets use a different algorithm that isn't as amendable to retaining insertion order.
Set-to-set operations lose their flexibility and optimizations if order is required. Set mathematics are defined in terms of unordered sets. In short, set ordering isn't in the immediate future.

– Raymond Hettinger

A detailed discussion about whether to compactify sets for 3.7, and why it was decided against, can be found in the python-dev mailing lists.

In summary, the main points are: different usage patterns (insertion ordering dicts such as **kwargs is useful, less so for sets), space savings for compacting sets are less significant (because there are only key + hash arrays to densify, as opposed to key + hash + value arrays), and the aforementioned linear probing optimization which sets currently use is incompatible with a compact implementation.

I will reproduce Raymond's post below which covers the most important points.

On Sep 14, 2016, at 3:50 PM, Eric Snow wrote:

Then, I'll do same to sets.

Unless I've misunderstood, Raymond was opposed to making a similar
change to set.

That's right. Here are a few thoughts on the subject before people
starting running wild.

  • For the compact dict, the space savings was a net win with the additional space consumed by the indices and the overallocation for
    the key/value/hash arrays being more than offset by the improved
    density of key/value/hash arrays. However for sets, the net was much
    less favorable because we still need the indices and overallocation
    but can only offset the space cost by densifying only two of the three
    arrays. In other words, compacting makes more sense when you have
    wasted space for keys, values, and hashes. If you lose one of those
    three, it stops being compelling.

  • The use pattern for sets is different from dicts. The former has more hit or miss lookups. The latter tends to have fewer missing key
    lookups. Also, some of the optimizations for the set-to-set operations
    make it difficult to retain set ordering without impacting
    performance.

  • I pursued alternative path to improve set performance. Instead of compacting (which wasn't much of space win and incurred the cost of an
    additional indirection), I added linear probing to reduce the cost of
    collisions and improve cache performance. This improvement is
    incompatible with the compacting approach I advocated for
    dictionaries.

  • For now, the ordering side-effect on dictionaries is non-guaranteed, so it is premature to start insisting the sets become ordered as well.
    The docs already link to a recipe for creating an OrderedSet (
    https://code.activestate.com/recipes/576694/ ) but it seems like the
    uptake has been nearly zero. Also, now that Eric Snow has given us a
    fast OrderedDict, it is easier than ever to build an OrderedSet from
    MutableSet and OrderedDict, but again I haven't observed any real
    interest because typical set-to-set data analytics don't really need
    or care about ordering. Likewise, the primary use of fast membership
    testings is order agnostic.

  • That said, I do think there is room to add alternative set implementations to PyPI. In particular, there are some interesting
    special cases for orderable data where set-to-set operations can be
    sped-up by comparing entire ranges of keys (see
    https://code.activestate.com/recipes/230113-implementation-of-sets-using-sorted-lists
    for a starting point). IIRC, PyPI already has code for set-like bloom
    filters and cuckoo hashing.

  • I understanding that it is exciting to have a major block of code accepted into the Python core but that shouldn't open to floodgates to
    engaging in more major rewrites of other datatypes unless we're sure
    that it is warranted.

– Raymond Hettinger

From [Python-Dev] Python 3.6 dict becomes compact and gets a private version; and keywords become ordered, Sept 2016.

Python 3.6+: Equality of two dictionaries with same keys but in different order

Dictionaries are hash tables, order is not supposed to matter. In python 3.6+ dictionaries are in insertion order but that is just how they are implemented. Order doesn't matter for equality. If you want order to matter in equality use an OrderedDict.

from collections import OrderedDict

d1 = OrderedDict({'foo':123, 'bar':789})
d2 = OrderedDict({'bar':789, 'foo':123})

print(d1 == d2) # False

Difference between dict and set (python)

There were no set literals in Python 2, historically curly braces were only used for dictionaries. Sets could be produced from lists (or any iterables):

set([1, 2, 3])
set([i for i in range(1, 3)])

Python 3 introduced set literals and comprehensions (see PEP-3100) which allowed us to avoid intermediate lists:

{1, 2, 3}
{i for i in range(1, 3)}

The empty set form, however, was reserved for dictionaries due to backwards compatibility. References from [Python-3000] sets in P3K? states:

I'm sure we can work something out --- I agree, {} for empty set and {:}
for empty dict would be ideal, were it not for backward compatibility. I
liked the "special empty object" idea when I first wrote the PEP (i.e.,
have {} be something that could turn into either a set or dict), but one
of the instructors here convinced me that it would just lead to confusion
in newcomers' minds (as well as being a pain to implement).

The following message describes these rules better:

I think Guido had the best solution. Use set() for empty sets, use {}
for empty dicts, use {genexp} for set comprehensions/displays, use
{1,2,3} for explicit set literals, and use {k1:v1, k2:v2} for dict
literals. We can always add {/} later if demand exceeds distaste.

Why is dictionary ordering non-deterministic?


Update: In Python 3.6, dict has a new implementation which preserves insertion order. From Python 3.7, this order-preserving behaviour is guaranteed:

the insertion-order preservation nature of dict objects has been declared to be an official part of the Python language spec.


This is the result of a security fix from 2012, which was enabled by default in Python 3.3 (scroll down to "Security improvements").

From the announcement:

Hash randomization causes the iteration order of dicts and sets to be
unpredictable and differ across Python runs. Python has never guaranteed
iteration order of keys in a dict or set, and applications are advised to never
rely on it. Historically, dict iteration order has not changed very often across
releases and has always remained consistent between successive executions of
Python. Thus, some existing applications may be relying on dict or set ordering.
Because of this and the fact that many Python applications which don't accept
untrusted input are not vulnerable to this attack, in all stable Python releases
mentioned here, HASH RANDOMIZATION IS DISABLED BY DEFAULT.

As noted above, the last, capitalized bit is no longer true in Python 3.3.

See also: object.__hash__() documentation ("Note" sidebar).

If absolutely necessary, you can disable hash randomization in versions of Python affected by this behaviour by setting the PYTHONHASHSEED environment variable to 0.


Your counterexample:

list({str(i): i for i in range(10)}.keys())

… does not in fact always give the same result in Python 3.3, although the number of different orderings is limited due to the way hash collisions are handled:

$ for x in {0..999}
> do
> python3.3 -c "print(list({str(i): i for i in range(10)}.keys()))"
> done | sort | uniq -c
61 ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
73 ['1', '0', '3', '2', '5', '4', '7', '6', '9', '8']
62 ['2', '3', '0', '1', '6', '7', '4', '5', '8', '9']
59 ['3', '2', '1', '0', '7', '6', '5', '4', '9', '8']
58 ['4', '5', '6', '7', '0', '1', '2', '3', '8', '9']
55 ['5', '4', '7', '6', '1', '0', '3', '2', '9', '8']
62 ['6', '7', '4', '5', '2', '3', '0', '1', '8', '9']
63 ['7', '6', '5', '4', '3', '2', '1', '0', '9', '8']
60 ['8', '9', '0', '1', '2', '3', '4', '5', '6', '7']
66 ['8', '9', '2', '3', '0', '1', '6', '7', '4', '5']
65 ['8', '9', '4', '5', '6', '7', '0', '1', '2', '3']
53 ['8', '9', '6', '7', '4', '5', '2', '3', '0', '1']
62 ['9', '8', '1', '0', '3', '2', '5', '4', '7', '6']
52 ['9', '8', '3', '2', '1', '0', '7', '6', '5', '4']
73 ['9', '8', '5', '4', '7', '6', '1', '0', '3', '2']
76 ['9', '8', '7', '6', '5', '4', '3', '2', '1', '0']

As noted at the beginning of this answer, that's no longer the case in Python 3.6:

$ for x in {0..999}
> do
> python3.6 -c "print(list({str(i): i for i in range(10)}.keys()))"
> done | sort | uniq -c
1000 ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Make ordered dicts behave like normal dicts in yaml.dump output

Using ruamel.yaml as drop in replacement for PyYAML solved the problem instantly. OrderedDicts are no longer represented as lists in the output.

This code:

import ruamel.yaml

yaml=ruamel.yaml.YAML()
yaml.dump()

Produces the much neater output:

refine: !!omap
- root: Wuhan/Hu-1/2019
- clock_rate: 0.0007
- clock_std_dev: 0.0003

Using ordered dictionary as ordered set

This approach of using a Python 3.7 dictionary as an order-preserving de-dupe is vetted by a core Python developer here. You can't really get a better recommendation than that.

Is there any reason this method shouldn't be used?

No.

Are there better ways to solve this problem?

No.

Is this method Pythonic?

Yes.

The method is simple but relies on implicit ordering.

Your question is tagged python-3.7. Dictionaries preserving insertion order is guaranteed, so there is not an implicit ordering here.

In Python, when to use a Dictionary, List or Set?

A list keeps order, dict and set don't: when you care about order, therefore, you must use list (if your choice of containers is limited to these three, of course ;-) ).

dict associates each key with a value, while list and set just contain values: very different use cases, obviously.

set requires items to be hashable, list doesn't: if you have non-hashable items, therefore, you cannot use set and must instead use list.

set forbids duplicates, list does not: also a crucial distinction. (A "multiset", which maps duplicates into a different count for items present more than once, can be found in collections.Counter -- you could build one as a dict, if for some weird reason you couldn't import collections, or, in pre-2.7 Python as a collections.defaultdict(int), using the items as keys and the associated value as the count).

Checking for membership of a value in a set (or dict, for keys) is blazingly fast (taking about a constant, short time), while in a list it takes time proportional to the list's length in the average and worst cases. So, if you have hashable items, don't care either way about order or duplicates, and want speedy membership checking, set is better than list.



Related Topics



Leave a reply



Submit