Time Complexity of Accessing a Python Dict

Time complexity of accessing a Python dict

See Time Complexity. The python dict is a hashmap, its worst case is therefore O(n) if the hash function is bad and results in a lot of collisions. However that is a very rare case where every item added has the same hash and so is added to the same chain which for a major Python implementation would be extremely unlikely. The average time complexity is of course O(1).

The best method would be to check and take a look at the hashs of the objects you are using. The CPython Dict uses int PyObject_Hash (PyObject *o) which is the equivalent of hash(o).

After a quick check, I have not yet managed to find two tuples that hash to the same value, which would indicate that the lookup is O(1)

l = []
for x in range(0, 50):
for y in range(0, 50):
if hash((x,y)) in l:
print "Fail: ", (x,y)
l.append(hash((x,y)))
print "Test Finished"

CodePad (Available for 24 hours)

Python dictionary keys. In complexity

First, key in d.keys() is guaranteed to give you the same value as key in d for any dict d.

And the in operation on a dict, or the dict_keys object you get back from calling keys() on it (in 3.x), is not O(N), it's O(1).

There's no real "optimization" going on; it's just that using the hash is the obvious way to implement __contains__ on a hash table, just as it's the obvious way to implement __getitem__.


You may ask where this is guaranteed.

Well, it's not. Mapping Types defines dict as, basically, a hash table implementation of collections.abc.Mapping. There's nothing stopping someone from creating a hash table implementation of a Mapping, but still providing O(N) searches. But it would be extra work to make such a bad implementation, so why would they?

If you really need to prove it to yourself, you can test every implementation you care about (with a profiler, or by using some type with a custom __hash__ and __eq__ that logs calls, or…), or read the source.


In 2.x, you do not want to call keys, because that generates a list of the keys, instead of a KeysView. You could use iterkeys, but that may generate an iterator or something else that's not O(1). So, just use the dict itself as a sequence.

Even in 3.x, you don't want to call keys, because there's no need to. Iterating a dict, checking its __contains__, and in general treating it like a sequence is always equivalent to doing the same thing to its keys, so why bother? (And of course building the trivial KeyView, and accessing through it, are going to add a few nanoseconds to your running time and a few keystrokes to your program.)

(It's not quite clear that using sequence operations is equivalent for d.keys()/d.iterkeys() and d in 2.x. Other than performance issues, they are equivalent in every CPython, Jython, IronPython, and PyPy implementation, but it doesn't seem to be stated anywhere the way it is in 3.x. And it doesn't matter; just use key in d.)


While we're at it, note that this:

if(dict[key] != None):

… is not going to work. If the key is not in the dict, this will raise KeyError, not return None.

Also, you should never check None with == or !=; always use is.

You can do this with a try—or, more simply, do if dict.get(key, None) is not None. But again, there's no reason to do so. Also, that won't handle cases where None is a perfectly valid item. If that's the case, you need to do something like sentinel = object(); if dict.get(key, sentinel) is not sentinel:.


So, the right thing to write is:

if key in d:

More generally, this is not true:

I know the "in" keyword is generally O(n) (as this just translates to python iterating over an entire list and comparing each element

The in operator, like most other operators, is just a call to a __contains__ method (or the equivalent for a C/Java/.NET/RPython builtin). list implements it by iterating the list and comparing each element; dict implements it by hashing the value and looking up the hash; blist.blist implements it by walking a B+Tree; etc. So, it could be O(n), O(1), O(log n), or something completely different.

What is the time complexity of python dict has_key() method

Short answer: worst case it is O(n). But the average case time complexity is O(1). The worst case is however very rare.

When you do a lookup, the key is first double hashed (once by the type of the key and once by the dictionary). Based on that result, we know in which bucket we have to search, and we start searching.

It is however possible that hash collissions occur: in that case multiple keys are in the same bucket, so we have to search among multiple keys. Worst case all the keys are in the same bucket, and thus we fallback on linear search.

Hash collisions (with a large amount of keys) are however very rare. Usually it is safe to assume that - regardless of the size of the dictionary - the number of keys in the same bucket will be fixed.

The fact that it is O(n) had some interesting consequences on security. Say you have a server that stores and retrieves data in a dictionary. Then of course the response time will scale with such a lookup. Now a hacker can design input in such a way that all keys are placed in the same bucket(s). As a result lookup will slow down, and eventually the server will not respond in a reasonable time anymore. That's why Python has a flag -R for hash randomization. This will change the hash function for every run and therefore make it harder for a hacker to design such input.

what is time complexity of python Dictionary.values()?

Q(1):I think it is O(1)

edit:I was wrong.It is O(n).Thanks to @Roy Cohen and @kaya3.

test code:

import timeit
def timeis(func):
def wrap(*args, **kwargs):
start = timeit.default_timer()
result = func(*args, **kwargs)
end = timeit.default_timer()
print(func.__name__, end-start)
return result
return wrap

import random
@timeis
def dict_values_test(dic,value):
return value in dic.values()

tiny_dic = {i : 10*i for i in range(1000)}
value = random.randint(1,1000)
dict_values_test(tiny_dic,value)
small_dic = {i : 10*i for i in range(1000000)}
value = random.randint(1,1000000)
dict_values_test(small_dic,value)
big_dic = {i : 10*i for i in range(100000000)}
value = random.randint(1,100000000)
dict_values_test(big_dic,value)

result:

dict_values_test 2.580000000002025e-05
dict_values_test 0.015847600000000073
dict_values_test 1.4836825999999999

Q(2):

code:

def find_key_by_value(dic,find_value):
return [k for k,v in dic.items() if v == find_value]

dic = {1:10,2:20,3:30,4:40,5:40}
print(find_key_by_value(dic,40))

result:

[4, 5]

Time complexity of python dictionary get() update() always O(1)?

If I use only strings with maximum length of 15 as keys for a dictionary in python, is it impossible to have any collisions?

No. Collisions can happen. The result of the hash function is truncated according to the "host system". So that means that the hash(..) of a string on a 32-bit system is 32-bit integer, and for a 64-bit system, it usually is a 64-bit number.

Now if we count the number of strings less than or equal to 15 characters (and we here will only assume printable ASCII characters, but if we consider all unicode characters, we only make it worse), then that means we can generate:

15
---
\ i
/ (128-32)
---
i=0

different strings. Which is equal to 547'792'552'280'497'574'758'284'371'041, or approximately 5.47×1029. The number of 64-bit numbers is 264=18'446'744'073'709'551'616≈1.84×1019. So even if we only consider ASCII printable strings, then we can not map every string to a separate hash.

As a result, that means that hash collisions will happen if we keep filling the dictionary with new strings (eventually). Even if the dictionary creates one bucket per hash code, multiple strings will get in the same bucket, because the "hash space" is smaller than the "string space".

Worst case seems to be O(N) for accessing or updating a value, with N being the number of keys in the dictionary. With the built in string hash of python it's impossible to have same hashes on two different strings with maximum length of 15 and the worst case would be O(1), right?

It is O(1) but due to another reason. Since the number of strings has at most 15 characters, that means that the number of possible strings (hence keys) is fixed. For example the number of ASCII printable keys is fixed to the number we derived above (5.47×1029). Yes, we can use unicode, and this will scale up the number of keys dramatically, but it is still finite (well it is approximately 5.06×1090).

That means that means that there is an upperbound for N, and therefore there is not really such thing as assymptotic complexity. Even if we manage to generate all these strings, and in the worst case these all map to the same hash code, and therefore all are stored in the same bucket, it is still constant time, the processor will have a very hard time iterating over the bucket, but it will still be constant: at most 5.06×1090 iterations.

What is the time complexity of searching through dictionary key values while using the key work *in*

Dictionaries are implemented as hash tables/maps, which have average O(1) performance for look ups. Internally, they have the same implementation as sets.

To further improve performance, replace

if nums[i] in cache.keys():

with

if nums[i] in cache:

Also, you can make an improvement with enumerate (which is more Pythonic):

def twoSum(self, nums, target):
cache = {}
for i, x in enumerate(nums):
b = target - x

if x in cache:
return [i, cache[x]]

cache[b] = i;

Time Complexity of Python dictionary len() method

Inspecting the c-source of dictobject.c shows that the structure contains a member responsible for maintaining an explicit count (dk_size)

layout:
+---------------+
| dk_refcnt |
| dk_size |
| dk_lookup |
| dk_usable |
| dk_nentries |
+---------------+
...

Thus it will have order O(1)

What is the time complexity of searching in Dict if very long strings are used as keys?

Since a dictionary is a hashtable, and looking up a key in a hashtable requires computing the key's hash, then the time complexity of looking up the key in the dictionary cannot be less than the time complexity of the hash function.

In current versions of CPython, a string of length L takes O(L) time to compute the hash of if it's the first time you've hashed that particular string object, and O(1) time if the hash for that string object has already been computed (since the hash is stored):

>>> from timeit import timeit
>>> s = 'b' * (10**9) # string of length 1 billion
>>> timeit(lambda: hash(s), number=1)
0.48574538500002973 # half a second
>>> timeit(lambda: hash(s), number=1)
5.301000044255488e-06 # 5 microseconds

So that's also how long it takes when you look up the key in a dictionary:

>>> s = 'c' * (10**9) # string of length 1 billion
>>> d = dict()
>>> timeit(lambda: s in d, number=1)
0.48521506899999167 # half a second
>>> timeit(lambda: s in d, number=1)
4.491000026973779e-06 # 5 microseconds

You also need to be aware that a key in a dictionary is not looked up only by its hash: when the hashes match, it still needs to test that the key you looked up is equal to the key used in the dictionary, in case the hash matching is a false positive. Testing equality of strings takes O(L) time in the worst case:

>>> s1 = 'a'*(10**9)
>>> s2 = 'a'*(10**9)
>>> timeit(lambda: s1 == s2, number=1)
0.2006020820001595

So for a key of length L and a dictionary of length n:

  • If the key is not present in the dictionary, and its hash has already been cached, then it takes O(1) average time to confirm it is absent.
  • If the key is not present and its hash has not been cached, then it takes O(L) average time because of computing the hash.
  • If the key is present, it takes O(L) average time to confirm it is present whether or not the hash needs to be computed, because of the equality test.
  • The worst case is always O(nL) because if every hash collides and the strings are all equal except in the last places, then a slow equality test has to be done n times.


Related Topics



Leave a reply



Submit