Differencebetween Len() and Sys.Getsizeof() Methods in Python

What is the difference between len() and sys.getsizeof() methods in python?

They are not the same thing at all.

len() queries for the number of items contained in a container. For a string that's the number of characters:

Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary).

sys.getsizeof() on the other hand returns the memory size of the object:

Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.

Python string objects are not simple sequences of characters, 1 byte per character.

Specifically, the sys.getsizeof() function includes the garbage collector overhead if any:

getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.

String objects do not need to be tracked (they cannot create circular references), but string objects do need more memory than just the bytes per character. In Python 2, __sizeof__ method returns (in C code):

Py_ssize_t res;
res = PyStringObject_SIZE + PyString_GET_SIZE(v) * Py_TYPE(v)->tp_itemsize;
return PyInt_FromSsize_t(res);

where PyStringObject_SIZE is the C struct header size for the type, PyString_GET_SIZE basically is the same as len() and Py_TYPE(v)->tp_itemsize is the per-character size. In Python 2.7, for byte strings, the size per character is 1, but it's PyStringObject_SIZE that is confusing you; on my Mac that size is 37 bytes:

>>> sys.getsizeof('')
37

For unicode strings the per-character size goes up to 2 or 4 (depending on compilation options). On Python 3.3 and newer, Unicode strings take up between 1 and 4 bytes per character, depending on the contents of the string.

Why is there a different result for getsizeof between list() and []

TL;DR: Lists over-allocate so they can provide amortized constant-time (O(1)) append operations. The amount of over-allocation depends on how the list is created and the append/delete history of the instance. A list-literal always knows the size beforehand and simply doesn't over-allocate (or only slightly). The list function not always knows the length of the result because it has to iterate over the argument, so the final over-allocation depends on the used (implementation-dependent) over-allocation-scheme.

To understand what we're looking at it's important to know that sys.getsizeof reports only the size of the instance. It doesn't look at the contents of the instance. So the size of the contents (in this case ints) isn't accounted for.

What actually contributes to the size of the list is (64bit system assumed):

  • 8-bytes: Reference-count.
  • 8-bytes: Pointer to class.
  • 8-bytes: Stores the number of elements in the list (equivalent to len(your_list)).
  • 8-bytes: Stores the size of the array that holds the elements in the list (this is len(your_list) + over_allocation) .
  • 8-bytes: Pointer to the array that stores pointers to the contents.
  • 8-bytes per slot of the list: To hold pointers (or NULL) to each of the element in the list.

  • 24-bytes: Needed for other stuff (I think garbage-collection)

That explanation is probably a bit hard to understand, so maybe it becomes clearer if I add some images (neglecting the extra 24 bytes that are used for garbage-collection). I created them based on my findings on CPython 3.7.2 Windows 64bit, Python 64bit from Anaconda.

No over-allocation, e.g. for mylist = [1,2,3]:

Sample Image

With over-allocation, e.g. for mylist = list([1,2,3]):

Sample Image

Or for manual appends:

mylist = []
mylist.append(1)
mylist.append(2)
mylist.append(3)

Sample Image

That means an empty list already takes 64 bytes, assuming empty lists have no over-allocation. For each element added another reference to a Python object has to be added (a pointer is 8 bytes).

So the minimum size of a list is:

size_min = 64 + 8 * n_items

Python lists are variable-sized and if it would only allocate as many space to hold the current amount of items you would have to copy the whole array whenever a new item is added (making it O(n)). However if you over-allocate, meaning that you actually take more memory than you would need to store the elements, then you can support amortized O(1) appends because it only needs to resize sometimes. See for example Wikipedia "Amortized analysis".

The next bit is that a literal always knows it's size, you put x items in the literal and at source-code parsing time it's already known how big the list has to be. So you can simply allocate the required memory for something like this:

l = [1, 2, 3]

However since list is a callable and Python doesn't optimize that call away even if the argument is simply a literal (I mean you could assign something different to the name list), it has to really call list.

list itself just iterates over the argument and appends the items to it's internal array, resizing when needed and over-allocating to make it amortized O(1). list can check what size the input has, but since (theoretically) anything could happen while iterating over an object it takes that length estimate as a rough guideline, not as guarantee. So while it avoids re-allocations if it can predict the number of items in the argument it still over-allocates (just in case).

Note that all of this is implementation-details, it could be totally different in other Python implementations, even in different CPython versions. The only thing Python guarantees (I think it does, I'm not 100% sure) is that append is amortized O(1) and not how it is achieved and how much memory a list instance needs.

Two bytes snowman return 76 bytes per sys.getsizeof()

First off, Python being a dynamically typed language, all objects carry type information with them. In C, an int just needs the bytes to represent the int; the knowledge that it is an int is implicit in the code. Not so in Python, not with numbers, and especially not with strings, which have become more complicated since PEP 393 was adopted. As you can see, both in the PEP and in the CPython source here, Python can represent strings in one of several representations. Unfortunately, the header information is rather large; a string not only has the header that every Python object has, it also has a flag structure, it carries its hash, its byte length, its true length, and more, in an onion-like structure that has more layers the more complex the representation. Thus, ASCII-only strings are quite short in comparison (PyASCIIObject); but a string that contains a character that is outside the ASCII range gets a new layer (PyCompactUnicodeObject). sys.getsizeof doesn't just give you the string length; it gives you the entirety of memory allocated for the object, including the header information, and it adds up.

Difference between len() and .__len__()?

len is a function to get the length of a collection. It works by calling an object's __len__ method. __something__ attributes are special and usually more than meets the eye, and generally should not be called directly.

It was decided at some point long ago getting the length of something should be a function and not a method code, reasoning that len(a)'s meaning would be clear to beginners but a.len() would not be as clear. When Python started __len__ didn't even exist and len was a special thing that worked with a few types of objects. Whether or not the situation this leaves us makes total sense, it's here to stay.

Why sys.getsizeof is not equal to os.path.getsize?

The two are not compatible in python. os.path.getsize give the size of a file, whereas sys.getsizeof gives the size of an object.

The file is 6 bytes, not 5, because of a line-ending (on Windows it might be 7 bytes). If you were using C then "hello" would be 6 bytes because a binary zero '\0' marks the end of the string. If you were using another language then it too would have its own red-tape memory overhead.

The memory occupied by the data is (generally) less than that occupied by an object. An object will include other information about the data, like its size and location. It is a price you pay for using a high-level language.



Related Topics



Leave a reply



Submit