Python: List VS Dict for Look Up Table

Python: List vs Dict for look up table

Speed

Lookups in lists are O(n), lookups in dictionaries are amortized O(1), with regard to the number of items in the data structure. If you don't need to associate values, use sets.

Memory

Both dictionaries and sets use hashing and they use much more memory than only for object storage. According to A.M. Kuchling in Beautiful Code, the implementation tries to keep the hash 2/3 full, so you might waste quite some memory.

If you do not add new entries on the fly (which you do, based on your updated question), it might be worthwhile to sort the list and use binary search. This is O(log n), and is likely to be slower for strings, impossible for objects which do not have a natural ordering.

Why are dict lookups always better than list lookups?

I know lists use C arrays under the hood which made me conclude that lookup in a list with just a few items would be better than in a dictionary (accessing a few elements in an array is faster than computing a hash).

Accessing a few array elements is cheap, sure, but computing == is surprisingly heavyweight in Python. See that spike in your second graph? That's the cost of computing == for two ints right there.

Your list lookups need to compute == a lot more than your dict lookups do.

Meanwhile, computing hashes might be a pretty heavyweight operation for a lot of objects, but for all ints involved here, they just hash to themselves. (-1 would hash to -2, and large integers (technically longs) would hash to smaller integers, but that doesn't apply here.)

Dict lookup isn't really that bad in Python, especially when your keys are just a consecutive range of ints. All ints here hash to themselves, and Python uses a custom open addressing scheme instead of chaining, so all your keys end up nearly as contiguous in memory as if you'd used a list (which is to say, the pointers to the keys end up in a contiguous range of PyDictEntrys). The lookup procedure is fast, and in your test cases, it always hits the right key on the first probe.

Okay, back to the spike in graph 2. The spike in the lookup times at 1024 entries in the second graph is because for all smaller sizes, the integers you were looking for were all <= 256, so they all fell within the range of CPython's small integer cache. The reference implementation of Python keeps canonical integer objects for all integers from -5 to 256, inclusive. For these integers, Python was able to use a quick pointer comparison to avoid going through the (surprisingly heavyweight) process of computing ==. For larger integers, the argument to in was no longer the same object as the matching integer in the dict, and Python had to go through the whole == process.

python: list lookup vs dict lookup

Yes, dictionary lookups take constant time. Your if k not in Lst may have to scan the whole list to see if the number is not yet in the list, before appending. It is this scan that makes list containment tests take O(n) time, and is what is killing your algorithm.

A python dictionary on the other hand, uses a hash table to test for membership. Each key is hashed (reduced to a number), with the number then being turned into in index into the table. If the key found at that location is equal to the key you are testing with, a match is found. Hashing can lead to collisions (two values hashing to the same table index), but the Python dictionary implementation has an algorithm to then look for a next slot in an efficient manner. If an empty slot is found, the containment test has failed, the key is not present.

So, to test if k is in your dictionary, for most tests just 1 calculation is needed. For some, a few more tests could be required. But on average, the lookup time is constant.

If you are curious, and understand C well enough, take a look at the C implementation for all the (well documented) details. You could also watch this Pycon 2010 presentation by Brandon Rhodes about how CPython dict works, or pick up a copy of Beautiful Code, which includes a chapter on the implementation written by Andrew Kuchling.

You way want to look at the set() type; this is, like a dictionary, an unordered collection with O(1) membership tests, but just the values, no keys:

some_set = set()

def do_something():
    some_set.add(k)

while N < 100000:
    if k not in some_set:
        do_something()

Internally, set() objects use a hash table as well.

Large list of dictionaries as a lookup table on disk

Your data seems to be regular, i.e. there is no variation of the dict's keys, right? One might simply use document-based solutions like MongoDB, but I think a simple SQL-based database might be more efficient and is easy to implement.

Alternatives would be the pickle module (not recommended for really large objects, as they are loaded into the memory) or shelve which builds on top of pickle, but is more efficient with large files, afaik (they aren't loaded into your memory at once). The benefit of shelve is it's syntax, which mimics pythons dict-syntax and should be easy to use (see the link). And there is no need to set up a MongoDB or MySQL database (which might get complicated, at least on Windows). Both pickle and shelve are part of the standard-lib.

You also might check datasets and it's easy-to-use interface. It uses a sqlite-db under the hood.

If you're dealing with huge files (let's say > 2 GB), I'd not stick to datasets or shelve, but use more mature soultions like sqlalchemy (+ MySQL-DB) or MongoDB and it's Python interface (PyMongo)

Should I use dict or list?

You really want a set. Sets are faster than lists because they can only contain unique elements, which allows them to be implemented as hash tables. Hash tables allow membership testing (if element in my_set) in O(1) time. This contrasts with lists, where the only way to check if an element is in the list is to check every element of the list in turn (in O(n) time.)

A dict is similar to a set in that both allow unique keys only, and both are implemented as hash tables. They both allow O(1) membership testing. The difference is that a set only has keys, while a dict has both keys and values (which is extra overhead you don't need in this application.)

Using a set, and replacing the nested for loop with an itertools.chain() to flatten the 2D list to a 1D list:

import itertools
seen = set()
for author in itertools.chain(*authors):
    seen.add(author)

Or shorter:

import itertools
seen = set( itertools.chain(*authors) )

Edit (thanks, @jamylak) more memory efficient for large lists:

import itertools
seen = set( itertools.chain.from_iterable(authors) )

Example on a list of lists:

>>> a = [[1,2],[1,2],[1,2],[3,4]]
>>> set ( itertools.chain(*a) )
set([1, 2, 3, 4])

P.S. : If, instead of finding all the unique authors, you want to count the number of times you see each author, use a collections.Counter, a special kind of dictionary optimised for counting things.

Here's an example of counting characters in a string:

>>> a = "DEADBEEF CAFEBABE"
>>> import collections
>>> collections.Counter(a)
Counter({'E': 5, 'A': 3, 'B': 3, 'D': 2, 'F': 2, ' ': 1, 'C': 1})

Time complexity for lookup in dictionary.values() lists vs sets

Let be x in s the operation which searches in a list, {x=item , s=list}

The average case - assuming parameters generated uniformly at random - for such operation will be O(n)

For more info about time complexity, here's the official link

Why isn't my dict lookup faster than my list lookup in Python?

The above result seems to imply that dictionaries are not as efficient
as lists for lookup tables, even though list lookups are O(n) while
dict lookups are O(1). I've tested the following to see if the
O(n)/O(1) performance was true... turns out it isn't...

It's not true that dict lookups are O(N), in the sense of "getting an item" which is the sense your code seems to test. Determining where (if at all) an element exists could be O(N), e.g. somelist.index(someval_not_in_the_list) or someval_not_in_the_list in somelist will both have to scan over each element. Try comparing x in somelist with x in somedict to see a major difference.

But simply accessing somelist[index] is O(1) (see the Time Complexity page). And the coefficient is probably going to be smaller than in the case of a dictionary, also O(1), because you don't have to hash the key.

Use dictionary as lookup table

Assuming data is a country code (like "PL" for example):

def get_path(data):
    return countries.get(data)

The get() method checks if the dictionary has the key, and if so returns the corresponding value, otherwise it returns None.

If you want a default value other than None when the key is not present you can specify it as second argument, like this:

def get_path(data):
    return countries.get(data, "default/path/x/y/z")

Python: List VS Dict for Look Up Table