Time Complexity of Python Set Operations

In Python, why are the time complexities different between set operations union, intersection, and symmetric difference?

Union

Consider two sets s and t. In order to build a new set which represent the union of s and t you need to iterate over them. This results in a time complexity of O(len(s) + len(t)).

def union(s, t):
"""Simple example for union, ignoring error-handling on inputs, etc."""
result = set(s) # copy: O(len(s))
for el in t: # iterate over t: O(len(t))
result.add(el) # ignoring collisions, O(1) amortized time
return result

Symmetric Difference

def symmetric_difference(s, t):
"""Simple example for symmetric difference, ignoring error-handling on inputs, etc."""
result = set(t) # copy: O(len(t))
for el in s: # iterate over s: O(len(s))
if el not in t:
result.add(el)
else:
result.remove(el)
return result

What CPython does, is to start from a copy of t, then iterate over s and add or remove element from the output set according to the result of the lookup.

Also in this case assuming the amortized time complexity for the lookup to be O(1), the resulting time complexity should be O(len(s) + len(t)), as for the union.

The table indicates a different time complexity for the average time complexity of the symmetric difference as O(s) and the reason might be that they ignore the time complexity of the make_new_set function (which builds a new set starting from t).

What is time complexity of a list to set conversion?

Yes. Iterating over a list is O(n) and adding each element to the hash set is O(1), so the total operation is O(n).

Time complexity for adding elements to list vs set in python

Both operations are O(1) amortized time-complexity.

Appending elements to a list has a lower coefficient because it does not need to hash the element first, nor does it need to check/handle hash collisions.

In the case of adding x into a set, Python needs to compute hash(x) first, because keeping the hash of all elements is what allows sets to have fast O(1) membership checks (compared to O(n) membership checks for lists).

What is the run time of the set difference function in Python?

looks that you're right, difference is performed with O(n) complexity in the best cases

But keep in mind that in worst cases (maximizing collisions with hashes) it can raise to O(n**2) (since lookup worst case is O(n): How is set() implemented?, but it seems that you can generally rely on O(1))

As an aside, speed depends on the type of object in the set. Integers hash well (roughly as themselves, with probably some modulo), whereas strings need more CPU.

What is the time complexity of pop() for the set in Python?

On modern CPython implementations, pop takes amortized constant-ish time (I'll explain further). On Python 2, it's usually the same, but performance can degrade heavily in certain cases.

A Python set is based on a hash table, and pop has to find an occupied entry in the table to remove and return. If it searched from the start of the table every time, this would take time proportional to the number of empty leading entries, and it would get slower with every pop.

To avoid this, the standard CPython implementation tries to remember the position of the last popped entry, to speed up sequences of pops. CPython 3.5+ has a dedicated finger member in the set memory layout to store this position, but earlier versions abuse the hash field of the first hash table entry to store this index.

On any Python version, removing all elements from a set with a sequence of pop operations will take time proportional to the size of the underlying hash table, which is usually within a small constant factor of the original number of elements (unless you'd already removed a bunch of elements). Mixing insertions and pop can interfere heavily with this on Python 2 if inserted elements land in hash table index 0, trashing the search finger. This is much less of an issue on Python 3.

Time complexity of python set.intersection for n sets

The python wiki on time complexity lists a single intersection as O(min(len(s), len(t)) where s and t are the sizes of the sets and t is a set. (In English: the time is bounded by and linear in the size of the smaller set.)

Note: based on the comments below, this wiki entry had been be wrong if the argument passed is not a set. I've corrected the wiki entry.

If you have n sets (sets, not iterables), you'll do n-1 intersections and the time can be
(n-1)O(len(s)) where s is the set with the smallest size.

Note that as you do an intersection the result may get smaller, so although O is the worst case, in practice, the time will be better than this.

However, looking at the specific code this idea of taking the min() only applies to a single pair of sets and doesn't extend to multiple sets. So in this case, we have to be pessimistic and take s as the set with the largest size.

When CPython set `in` operator is O(n)?

Load factor is a red herring. In CPython sets (and dicts) automatically resize to keep the load factor under 2/3. There's nothing you can do in Python code to stop that.

O(N) behavior can occur when a great many elements have exactly the same hash code. Then they map to the same hash bucket, and set lookup degenerates to a slow form of linear search.

The easiest way to contrive such bad elements is to create a class with a horrible hash function. Like, e.g., and untested:

class C:
def __init__(self, val):
self.val = val
def __eq__(a, b):
return a.val == b.val
def __hash__(self):
return 3

Then hash(C(i)) == 3 regardless of the value of i.

To do the same with builtin types requires deep knowledge of their CPython implementation details. For example, here's a way to create an arbitrarily large number of distinct ints with the same hash code:

>>> import sys
>>> M = sys.hash_info.modulus
>>> set(hash(1 + i*M) for i in range(10000))
{1}

which shows that the ten thousand distinct ints created all have hash code 1.



Related Topics



Leave a reply



Submit