How Is Tuple Implemented in Cpython

How is tuple implemented in CPython?

As a caveat, everything in this answer is based on what I've gleaned from looking over the implementation you linked.

It seems that the standard implementation of a tuple is simply as an array. However, there are a bunch of optimizations in place to speed things up.

First, if you try to make an empty tuple, CPython instead will hand back a canonical object representing the empty tuple. As a result, it can save on a bunch of allocations that are just allocating a single object.

Next, to avoid allocating a bunch of small objects, CPython recycles memory for many small lists. There is a fixed constant (PyTuple_MAXSAVESIZE) such that all tuples less than this length are eligible to have their space reclaimed. Whenever an object of length less than this constant is deallocated, there is a chance that the memory associated with it will not be freed and instead will be stored in a "free list" (more on that in the next paragraph) based on its size. That way, if you ever need to allocate a tuple of size n and one has previously been allocated and is no longer in use, CPython can just recycle the old array.

The free list itself is implemented as an array of size PyTuple_MAXSAVESIZE storing pointers to unused tuples, where the nth element of the array points either to NULL (if no extra tuples of size n are available) or to a reclaimed tuple of size n. If there are multiple different tuples of size n that could be reused, they are chained together in a sort of linked list by having each tuple's zeroth entry point to the next tuple that can be reused. (Since there is only one tuple of length zero ever allocated, there is never a risk of reading a nonexistent zeroth element). In this way, the allocator can store some number of tuples of each size for reuse. To ensure that this doesn't use too much memory, there is a second constant PyTuple_MAXFREELIST that controls the maximum length of any of these linked lists within any bucket. There is then a secondary array of length PyTuple_MAXSAVESIZE that stores the length of the linked lists for tuples of each given length so that this upper limit isn't exceeded.

All in all, it's a very clever implementation!

What makes python tuples immutable. How are they implemented in the memory?

There's nothing special about the implementation or memory layout of tuples that makes them immutable. They're immutable because they just don't have any mutator operations.

Lists are mutable because the list class implements operations like append and __setitem__ that mutate lists. Such operations have to be deliberately included in the implementation; they don't come into existence automatically. If list didn't have mutator operations, lists would be immutable too.

At the level of the C implementation, the C code that implements tuples has to be able to write to a tuple's memory, and at that level, the data structures are mutable. The interface the implementation presents to Python code is immutable, though.

You can't do the same with a class you implement in Python because you can't get a firm separation between implementation and interface. Such a separation arises automatically when implementing a Python class in C due to the design of the Python C API, but without anything like a private access modifier, you can't do the same in Python.

Underlying datastructure of list, tuple, dict

The best place to look is the CPython implementation source code:

  • dict - Hash map targeting fast resolution of keys
  • list - Looks like an array of PyObjects
  • tuple - Same as list but with optimisations that a tuple can allow (fixed size, objects)
  • set - Hash map with optimisations for cache locality

The source code is heavily commented and well written C. This would be the best place to understand the data structures used in detail.

How named tuples are implemented internally in python?

Actually, it's very easy to find out how a given namedtuple is implemented: if you pass the keyword argument verbose=True when creating it, its class definition is printed:

>>> Point = namedtuple('Point', "x y", verbose=True)
from builtins import property as _property, tuple as _tuple
from operator import itemgetter as _itemgetter
from collections import OrderedDict

class Point(tuple):
'Point(x, y)'

__slots__ = ()

_fields = ('x', 'y')

def __new__(_cls, x, y):
'Create new instance of Point(x, y)'
return _tuple.__new__(_cls, (x, y))

@classmethod
def _make(cls, iterable, new=tuple.__new__, len=len):
'Make a new Point object from a sequence or iterable'
result = new(cls, iterable)
if len(result) != 2:
raise TypeError('Expected 2 arguments, got %d' % len(result))
return result

def _replace(_self, **kwds):
'Return a new Point object replacing specified fields with new values'
result = _self._make(map(kwds.pop, ('x', 'y'), _self))
if kwds:
raise ValueError('Got unexpected field names: %r' % list(kwds))
return result

def __repr__(self):
'Return a nicely formatted representation string'
return self.__class__.__name__ + '(x=%r, y=%r)' % self

@property
def __dict__(self):
'A new OrderedDict mapping field names to their values'
return OrderedDict(zip(self._fields, self))

def _asdict(self):
'''Return a new OrderedDict which maps field names to their values.
This method is obsolete. Use vars(nt) or nt.__dict__ instead.
'''
return self.__dict__

def __getnewargs__(self):
'Return self as a plain tuple. Used by copy and pickle.'
return tuple(self)

def __getstate__(self):
'Exclude the OrderedDict from pickling'
return None

x = _property(_itemgetter(0), doc='Alias for field number 0')

y = _property(_itemgetter(1), doc='Alias for field number 1')

So, it's a subclass of tuple with some extra methods to give it the required behaviour, a _fields class-level constant containing the field names, and property methods for attribute access to the tuple's members.

As for the code behind actually building this class definition, that's deep magic.

List vs tuple, when to use each?

There's a strong culture of tuples being for heterogeneous collections, similar to what you'd use structs for in C, and lists being for homogeneous collections, similar to what you'd use arrays for. But I've never quite squared this with the mutability issue mentioned in the other answers. Mutability has teeth to it (you actually can't change a tuple), while homogeneity is not enforced, and so seems to be a much less interesting distinction.

How is std::tuple implemented?

One approach to implementing tuples is using multiple-inheritance. The tuple-elements are held by leaf-classes, and the tuple class itself inherits from multiple leafs. In pseudo-code:

template<typename T0, typename T1, ..., typename Tn>
class PseudoTuple : TupleLeaf<0, T0>, TupleLeaf<1, T1>, ..., TupleLeaf<n, Tn> {
...
};

Each leaf has an index, so that each base-class becomes unique even if the types they contain are identical, so we can access the nth element with a simple static_cast:

static_cast<TupleLeaf<0, T0>*>(this);
// ...
static_cast<TupleLeaf<n, Tn>*>(this);

I've written up a detailed explanation about this "flat" tuple implementation here: C++11 tuple implementation details (Part 1)

How does python compute the hash of a tuple

If I have a tuple with many elements, is its hash calculated from its elements' ids or its elements' content?

Neither. It is calculated on the basis of the hashes of these elements, not their "contents" (values/attributes), nor IDs.



The basics: why hashes are used the way they are

Take a look at this paragraph in python's documentation glossary.

Whether something is hashable or not, and how it is hashed, depends on the implementation of its __hash__() method. By itself, Python has no idea about mutability of an object. It's the implementation that is expected to provide appropriate mechanisms and avoid pitfalls.

A hash is useful in identification of objects. For example, it speeds up data retrieval from a dict, identifying the arbitrary value of a key by a single numerical value from a finite interval - the key's hash.

A hash should remain unchanged throughout the lifetime of the object. Otherwise, one object could map to two different values in a dict, or be included into a set twice, as soon as its hash changes.

It's not enough to compare two objects by their hashes: at the end of the day, you may still need to perform equality checks, because there may be a collision between the hashes of two different objects. That's why hashable objects are required to have __eq__() implemented.

This ties back to the mutability: if a hashable object mutates such that it changes equality comparisons with hashables, especially the ones with the same hash - it breaks the contract, and may result in the same weirdness a mutating hash would. Hashable objects should not mutate comparisons between themselves.

Hashable objects that are equal to each other should have the same hash. This is a general contract that makes everything else simpler - it's natural to assume x == y implies that both x and y map to the same value in a dict.



Hash of a tuple

Consider your first example. The tuple hashes itself on the basis of its elements, while its second element, the list, doesn't have a hash at all - the __hash__ method is not implemented for it. And so the tuple.__hash__ method fails.

That's why a tuple with a list object inside of it is not hashable. As you can see, it is therefore also incorrect to say, that a tuple hash is based on the IDs of its elements.

Notice, that if the list was hashable here, and the hash was based on its elements, changing them would change the hash of the outer tuple, breaking the contract.



Why my custom class doesn't require a __hash__()?

Let's have a look at python data model documentation, and what it has to say on the topic:

User-defined classes have __eq__() and __hash__() methods by default; with them, all objects compare unequal (except with themselves) and x.__hash__() returns an appropriate value such that x == y implies both that x is y and hash(x) == hash(y).

Put simply, the default implementation compares objects identity, which has nothing to do with object attributes. That's why you can change the values "inside" the object of your custom class without changing its hash.

That's also why you don't have to define __hash__() for your classes - python does it for you in this case.

In this regard you're right - the default (CPython's) implementation of the hashing function for custom classes relies on the id() of an object (and not on the values "inside" of it). It is an implementation detail, and it differs between Python versions.

In more recent versions of Python, the relation between hash() and id() involves randomization. This prevents some forms of denial of service attacks, where creating arbitrary hash collisions could significantly slow down web applications. See PEP-456.



How does it actually hash itself?

While the details are quite complicated and probably involve some advanced math, the implementation of the hash function for tuple objects is written in C, and can be seen here (see static Py_hash_t tuplehash(PyTupleObject *v)).

The calculation involves XORing a constant with the hashes of each of the tuple's elements. The line responsible for hashing of the elements is this one:

y = PyObject_Hash(*p++);

So, to answer your original question: it does a bunch of XOR hokus-pocus with the hashes of each of its elements. Whether or not the contents and attributes of these elements are considered depends on their specific hash functions.

Why implement two so similar data structures like List and Tuple

Immutable types are hashable, and can be used as dictionary keys. This works:

key = (1, 2, 3)
d = {key: 1}

But this doesn't:

key = [1, 2, 3]
d = {key: 1}

If it did, what would you expect this to do?

key[0] = 2
print d[key] # id(key) hasn't changed, so surely the lookup should still work
print d[[1, 2, 3]] # but also, we stored a piece of data at [1, 2, 3], didn't we?
print d[[2, 2, 3]] # but if d[key] works, surely we can expand key to its value

What's the difference between lists and tuples?

Apart from tuples being immutable there is also a semantic distinction that should guide their usage. Tuples are heterogeneous data structures (i.e., their entries have different meanings), while lists are homogeneous sequences. Tuples have structure, lists have order.

Using this distinction makes code more explicit and understandable.

One example would be pairs of page and line number to reference locations in a book, e.g.:

my_location = (42, 11)  # page number, line number

You can then use this as a key in a dictionary to store notes on locations. A list on the other hand could be used to store multiple locations. Naturally one might want to add or remove locations from the list, so it makes sense that lists are mutable. On the other hand it doesn't make sense to add or remove items from an existing location - hence tuples are immutable.

There might be situations where you want to change items within an existing location tuple, for example when iterating through the lines of a page. But tuple immutability forces you to create a new location tuple for each new value. This seems inconvenient on the face of it, but using immutable data like this is a cornerstone of value types and functional programming techniques, which can have substantial advantages.

There are some interesting articles on this issue, e.g. "Python Tuples are Not Just Constant Lists" or "Understanding tuples vs. lists in Python". The official Python documentation also mentions this

"Tuples are immutable, and usually contain an heterogeneous sequence ...".

In a statically typed language like Haskell the values in a tuple generally have different types and the length of the tuple must be fixed. In a list the values all have the same type and the length is not fixed. So the difference is very obvious.

Finally there is the namedtuple in Python, which makes sense because a tuple is already supposed to have structure. This underlines the idea that tuples are a light-weight alternative to classes and instances.

Implementing Tuples and Lists in the isinstance Function in Python 2.7

If you want to be able to add lists or tuples, change your __add__ method:

def __add__(self, other):
if not isinstance(other, (Point, list, tuple)):
raise TypeError("Must be of type Point, list, or tuple")
if isinstance(other, (list, tuple)):
other = Point(other[0], other[1])
x = self.X + other.X
y = self.Y + other.Y
return Point(x, y)

Otherwise, you'd have to add another Point object, not a list. In that case, just tweak your last line:

print p1 + Point(3.5, 6)


Related Topics



Leave a reply



Submit