In-Memory Size of a Python Structure

In-memory size of a Python structure

The recommendation from an earlier question on this was to use sys.getsizeof(), quoting:

>>> import sys
>>> x = 2
>>> sys.getsizeof(x)
14
>>> sys.getsizeof(sys.getsizeof)
32
>>> sys.getsizeof('this')
38
>>> sys.getsizeof('this also')
48

You could take this approach:

>>> import sys
>>> import decimal
>>> 
>>> d = {
...     "int": 0,
...     "float": 0.0,
...     "dict": dict(),
...     "set": set(),
...     "tuple": tuple(),
...     "list": list(),
...     "str": "a",
...     "unicode": u"a",
...     "decimal": decimal.Decimal(0),
...     "object": object(),
... }
>>> for k, v in sorted(d.iteritems()):
...     print k, sys.getsizeof(v)
...
decimal 40
dict 140
float 16
int 12
list 36
object 8
set 116
str 25
tuple 28
unicode 28

2012-09-30

python 2.7 (linux, 32-bit):

decimal 36
dict 136
float 16
int 12
list 32
object 8
set 112
str 22
tuple 24
unicode 32

python 3.3 (linux, 32-bit)

decimal 52
dict 144
float 16
int 14
list 32
object 8
set 112
str 26
tuple 24
unicode 26

2016-08-01

OSX, Python 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin

decimal 80
dict 280
float 24
int 24
list 72
object 16
set 232
str 38
tuple 56
unicode 52

Find the memory size of a set of strings vs. set of bytestrings

sys.getsizeof does not measure the size of the full target data structure. It only measure the memory taken by the set object which contains references to strings/bytes objects. The references are not included in the returned memory consumption (ie. it does not walk recursively in each object of the target data structure). A reference takes typically 8 bytes on a 64-bit platform and a CPython set is not as compact as a list: it is implemented like a hash-table with many buckets and some buckets are unused. In fact, this is mandatory for this data structure to be fast (in general, the occupancy should be 50%-90%). Moreover, each bucket contains a hash which usually takes 8 bytes.

The string themselves take much more space than a bucket (at least on my machine):

sys.getsizeof(randomstring(50))           # 99
sys.getsizeof(randomstring(50).encode())  # 83

On my machine, it turns out that CPython strings are 16 bytes bigger than bytes.

memory size of Python data structure

Have a look at the sys.getsizeof function. According to the documentation, it returns the size of an object in bytes, as given by the object's __sizeof__ method.

As Daniel pointed out in a comment, it's not recursive; it only counts bytes occupied by the object itself, not other objects it refers to. This recipe for a recursive computation is linked to by the Python 3 documentation.

Memory-efficient data structure for a set of short bytes-strings

Up to now here are the methods that I tested thanks to comments, and that seem working.

Sorted list + bisection search (+ bloom filter)

Insert everything in a standard list L, in sorted order. This takes a lot less memory than a set.
(optional) Create a Bloom filter, here is a very small code to do it.
(optional) First test membership with Bloom filter (fast).
Check if it really is a match (and not a false positive) with the fast in_sorted_list() from this answer using bisect, much faster than a standard lookup b"hello" in L.

If the bisection search is fast enough, we can even bypass the bloom filter (steps 2 and 3). It will be O(log n).

In my test with 100M strings, even without bloom filter, the lookup took 2 µs on average.

Sqlite3

As suggested by @tomalak's comment, inserting all the data in a Sqlite3 database works very well.
Querying if a string exists in the database was done in 50 µs on average on my 8 GB database, even without any index.

Adding an index made the DB grow to 11 GB, but then the queries were still done in ~50 µs on average, so no gain here.

Edit: as mentioned in a comment, using CREATE TABLE t(s TEXT PRIMARY KEY) WITHOUT ROWID; even made the DB smaller: 3.3 GB, and the queries are still done in ~50 µs on average. Sqlite3 is (as always) really amazing.

In this case, it's even possible to load it totally in RAM with the method from How to load existing db file to memory in Python sqlite3?, and then it's ~9 µs per query!

Bisection in file with sorted lines

Working, and with very fast queries (~ 35 µs per query), without loading the file in memory! See
Bisection search in the sorted lines of an opened file (not loaded in memory)

Dict with prefixes as keys and concatenation of suffixes as values

This is the solution described here: Set of 10-char strings in Python is 10 times bigger in RAM as expected.

The idea is: we have a dict D and, for a given word,

prefix, suffix = word[:4], word[4:]
D[prefix] +=  suffix + b' '

With this method, the RAM space used is even smaller than the actual data (I tested with 30M of strings of average length 14, and it used 349 MB), the queries seem very fast (2 µs), but the initial creation time of the dict is a bit high.

I also tried with dict values = list of suffixes, but it's much more RAM-consuming.

Size of list in memory

Here's a fuller interactive session that will help me explain what's going on (Python 2.6 on Windows XP 32-bit, but it doesn't matter really):

>>> import sys
>>> sys.getsizeof([])
36
>>> sys.getsizeof([1])
40
>>> lst = []
>>> lst.append(1)
>>> sys.getsizeof(lst)
52
>>>

Note that the empty list is a bit smaller than the one with [1] in it. When an element is appended, however, it grows much larger.

The reason for this is the implementation details in Objects/listobject.c, in the source of CPython.

Empty list

When an empty list [] is created, no space for elements is allocated - this can be seen in PyList_New. 36 bytes is the amount of space required for the list data structure itself on a 32-bit machine.

List with one element

When a list with a single element [1] is created, space for one element is allocated in addition to the memory required by the list data structure itself. Again, this can be found in PyList_New. Given size as argument, it computes:

nbytes = size * sizeof(PyObject *);

And then has:

if (size <= 0)
    op->ob_item = NULL;
else {
    op->ob_item = (PyObject **) PyMem_MALLOC(nbytes);
    if (op->ob_item == NULL) {
        Py_DECREF(op);
        return PyErr_NoMemory();
    }
    memset(op->ob_item, 0, nbytes);
}
Py_SIZE(op) = size;
op->allocated = size;

So we see that with size = 1, space for one pointer is allocated. 4 bytes (on my 32-bit box).

Appending to an empty list

When calling append on an empty list, here's what happens:

PyList_Append calls app1
app1 asks for the list's size (and gets 0 as an answer)
app1 then calls list_resize with size+1 (1 in our case)
list_resize has an interesting allocation strategy, summarized in this comment from its source.

Here it is:

/* This over-allocates proportional to the list size, making room
* for additional growth.  The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);

/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
    PyErr_NoMemory();
    return -1;
} else {
    new_allocated += newsize;
}

Let's do some math

Let's see how the numbers I quoted in the session in the beginning of my article are reached.

So 36 bytes is the size required by the list data structure itself on 32-bit. With a single element, space is allocated for one pointer, so that's 4 extra bytes - total 40 bytes. OK so far.

When app1 is called on an empty list, it calls list_resize with size=1. According to the over-allocation algorithm of list_resize, the next largest available size after 1 is 4, so place for 4 pointers will be allocated. 4 * 4 = 16 bytes, and 36 + 16 = 52.

Indeed, everything makes sense :-)

Variable's memory size in Python

Use sys.getsizeof to get the size of an object, in bytes.

>>> from sys import getsizeof
>>> a = 42
>>> getsizeof(a)
12
>>> a = 2**1000
>>> getsizeof(a)
146
>>>

Note that the size and layout of an object is purely implementation-specific. CPython, for example, may use totally different internal data structures than IronPython. So the size of an object may vary from implementation to implementation.

Find out how much memory is being used by an object in Python

There's no easy way to find out the memory size of a python object. One of the problems you may find is that Python objects - like lists and dicts - may have references to other python objects (in this case, what would your size be? The size containing the size of each object or not?). There are some pointers overhead and internal structures related to object types and garbage collection. Finally, some python objects have non-obvious behaviors. For instance, lists reserve space for more objects than they have, most of the time; dicts are even more complicated since they can operate in different ways (they have a different implementation for small number of keys and sometimes they over allocate entries).

There is a big chunk of code (and an updated big chunk of code) out there to try to best approximate the size of a python object in memory.

You may also want to check some old description about PyObject (the internal C struct that represents virtually all python objects).

In-Memory Size of a Python Structure