In Python, why are the time complexities different between set operations union, intersection, and symmetric difference?
Union
Consider two sets s
and t
. In order to build a new set which represent the union of s
and t
you need to iterate over them. This results in a time complexity of O(len(s) + len(t))
.
def union(s, t):
"""Simple example for union, ignoring error-handling on inputs, etc."""
result = set(s) # copy: O(len(s))
for el in t: # iterate over t: O(len(t))
result.add(el) # ignoring collisions, O(1) amortized time
return result
Symmetric Difference
def symmetric_difference(s, t):
"""Simple example for symmetric difference, ignoring error-handling on inputs, etc."""
result = set(t) # copy: O(len(t))
for el in s: # iterate over s: O(len(s))
if el not in t:
result.add(el)
else:
result.remove(el)
return result
What CPython does, is to start from a copy of t
, then iterate over s
and add or remove element from the output set according to the result of the lookup.
Also in this case assuming the amortized time complexity for the lookup to be O(1), the resulting time complexity should be O(len(s) + len(t))
, as for the union.
The table indicates a different time complexity for the average time complexity of the symmetric difference as O(s)
and the reason might be that they ignore the time complexity of the make_new_set
function (which builds a new set starting from t
).
What is time complexity of a list to set conversion?
Yes. Iterating over a list is O(n)
and adding each element to the hash set is O(1)
, so the total operation is O(n)
.
Time complexity for adding elements to list vs set in python
Both operations are O(1) amortized time-complexity.
Appending elements to a list has a lower coefficient because it does not need to hash the element first, nor does it need to check/handle hash collisions.
In the case of adding x
into a set
, Python needs to compute hash(x)
first, because keeping the hash of all elements is what allows sets to have fast O(1) membership checks (compared to O(n) membership checks for lists).
What is the run time of the set difference function in Python?
looks that you're right, difference is performed with O(n)
complexity in the best cases
But keep in mind that in worst cases (maximizing collisions with hashes) it can raise to O(n**2)
(since lookup worst case is O(n)
: How is set() implemented?, but it seems that you can generally rely on O(1)
)
As an aside, speed depends on the type of object in the set
. Integers hash well (roughly as themselves, with probably some modulo), whereas strings need more CPU.
What is the time complexity of pop() for the set in Python?
On modern CPython implementations, pop
takes amortized constant-ish time (I'll explain further). On Python 2, it's usually the same, but performance can degrade heavily in certain cases.
A Python set
is based on a hash table, and pop
has to find an occupied entry in the table to remove and return. If it searched from the start of the table every time, this would take time proportional to the number of empty leading entries, and it would get slower with every pop
.
To avoid this, the standard CPython implementation tries to remember the position of the last pop
ped entry, to speed up sequences of pop
s. CPython 3.5+ has a dedicated finger
member in the set memory layout to store this position, but earlier versions abuse the hash field of the first hash table entry to store this index.
On any Python version, removing all elements from a set with a sequence of pop
operations will take time proportional to the size of the underlying hash table, which is usually within a small constant factor of the original number of elements (unless you'd already removed a bunch of elements). Mixing insertions and pop
can interfere heavily with this on Python 2 if inserted elements land in hash table index 0, trashing the search finger. This is much less of an issue on Python 3.
Time complexity of python set.intersection for n sets
The python wiki on time complexity lists a single intersection as O(min(len(s), len(t)) where s and t are the sizes of the sets and t is a set. (In English: the time is bounded by and linear in the size of the smaller set.)
Note: based on the comments below, this wiki entry had been be wrong if the argument passed is not a set. I've corrected the wiki entry.
If you have n sets (sets, not iterables), you'll do n-1 intersections and the time can be
(n-1)O(len(s)) where s is the set with the smallest size.
Note that as you do an intersection the result may get smaller, so although O is the worst case, in practice, the time will be better than this.
However, looking at the specific code this idea of taking the min() only applies to a single pair of sets and doesn't extend to multiple sets. So in this case, we have to be pessimistic and take s as the set with the largest size.
When CPython set `in` operator is O(n)?
Load factor is a red herring. In CPython sets (and dicts) automatically resize to keep the load factor under 2/3. There's nothing you can do in Python code to stop that.
O(N)
behavior can occur when a great many elements have exactly the same hash code. Then they map to the same hash bucket, and set lookup degenerates to a slow form of linear search.
The easiest way to contrive such bad elements is to create a class with a horrible hash function. Like, e.g., and untested:
class C:
def __init__(self, val):
self.val = val
def __eq__(a, b):
return a.val == b.val
def __hash__(self):
return 3
Then hash(C(i)) == 3
regardless of the value of i
.
To do the same with builtin types requires deep knowledge of their CPython implementation details. For example, here's a way to create an arbitrarily large number of distinct ints with the same hash code:
>>> import sys
>>> M = sys.hash_info.modulus
>>> set(hash(1 + i*M) for i in range(10000))
{1}
which shows that the ten thousand distinct ints created all have hash code 1.
Related Topics
Python Scope: "Unboundlocalerror: Local Variable 'C' Referenced Before Assignment"
How to Capture Output of Python's Interpreter and Show in a Text Widget
Class That Acts as Mapping for **Unpacking
How to Set Window Size in Selenium Chrome Python
Cannot Install Python 3.7 on Osx-Arm64
Cannot Pass an Argument to Python with "#!/Usr/Bin/Env Python"
How to Run Python Script Without Typing 'Python ...'
Pandas Out of Bounds Nanosecond Timestamp After Offset Rollforward Plus Adding a Month Offset
Concatenate Numpy Arrays Without Copying
How to Get "Python -M Venv" to Directly Install Latest Pip Version
Use Python's String.Replace VS Re.Sub
Python Random Sample with a Generator/Iterable/Iterator