Python memory usage of numpy arrays
You can use array.nbytes
for numpy arrays, for example:
>>> import numpy as np
>>> from sys import getsizeof
>>> a = [0] * 1024
>>> b = np.array(a)
>>> getsizeof(a)
8264
>>> b.nbytes
8192
Does NumPy array really take less memory than python list?
The calculation using:
size = narray.size * narray.itemsize
does not include the memory consumed by non-element attributes of the array object. This can be verified by the documentation of ndarray.nbytes
:>>> x = np.zeros((3,5,2), dtype=np.complex128)
>>> x.nbytes
480
>>> np.prod(x.shape) * x.itemsize
480
In the above link, it can be read that ndarray.nbytes
:Does not include memory consumed by non-element attributes of theNote that from the code above you can conclude that your calculation excludes non-element attributes given that the value is equal to the one from
array object.
ndarray.nbytes
.A list of the non-element attributes can be found in the section Array Attributes, including here for completeness:
ndarray.flags Information about the memory layout of the array.With regards to
ndarray.shape Tuple of array dimensions.
ndarray.strides Tuple of
bytes to step in each dimension when traversing an array.
ndarray.ndim Number of array dimensions.
ndarray.data Python buffer object pointing to the start of the array’s data.
ndarray.size Number of elements in the array.
ndarray.itemsize Length of one array element in bytes.
ndarray.nbytes Total bytes consumed by the elements of the array.
ndarray.base Base object if memory is from some other object.
sys.getsizeof
it can be read in the documentation (emphasis mine) that:Only the memory consumption directly attributed to the object is
accounted for, not the memory consumption of objects it refers to.
How much memory is used by a numpy ndarray?
The array is simply stored in one consecutive block in memory. Assuming by "float" you mean standard double precision floating point numbers, then the array will need 8 bytes per element.
In general, you can simply query the nbytes
attribute for the total memory requirement of an array, and itemsize
for the size of a single element in bytes:
>>> a = numpy.arange(1000.0)
>>> a.nbytes
8000
>>> a.itemsize
8
In addtion to the actual array data, there will also be a small data structure containing the meta-information on the array. Especially for large arrays, the size of this data structure is negligible. RAM usage in dealing with numpy arrays and Python lists
I've got the solution.
First of all, as Dan Mašek pointed out, I'm measuring the memory used by the array, which is a collection of pointers (roughly said). To measure the real memory usage:
(database_list[0].nbytes * len(database_list) / 1000000, "MB")
where database_list[0].nbytes
is reliable as all the elements in database_list
have the same size. To be more precise, I should add the array metadata and eventually all data linked to it (if, for example, I'm storing in the array other structures).To have less impact on memory, I should know the type of data that I'm reading, that is values in range 0-65535, so:
database_list = list()
for filename in glob.glob('*.npy'):
temp_img = np.load(filename)
temp_img = temp_img.reshape((-1, 64)).astype(np.uint16)
database_list.append(temp_img)
Moreover, if I do some calculations on the data stored in database_list, for example, normalization of values in the range 0-1 like database_list = database_list/ 65535.0
(NB: database_list, as a list, does not support that operation), I should do another cast, because Python cast the type to something like float64. Memory used by numpy arrays larger than RAM?
I guess you use npTDMS.
The used numpy.array
type is not just a simple array where all array elements are always stored in memory.
While the data type and number of elements is known (by reading meta data from the TDMS file, in this case), the elements are not read until requested.
That is: If you want the last element of a 20GB record, npTDMS knows where it is stored in the file, reads and returns it - without reading the first 20GB.
Difference between list and NumPy array memory size
getsizeof
is not a good measure of memory use, especially with lists. As you note the list has a buffer of pointers to objects elsewhere in memory. getsizeof
notes the size of the buffer, but tells us nothing about the objects.
With
In [66]: list(range(4))
Out[66]: [0, 1, 2, 3]
the list has its basic object storage, plus the buffer with 4 pointers (plus some growth room). The numbers are stored else where. In this case the numbers are small, and already created and cached by the interpreter. So their storage doesn't add anything. But larger numbers (and floats) are created with each use, and take up space. Also a list can contain anything, such as pointers to other lists, or strings or dicts, or what ever.In [67]: arr = np.array([i for i in range(4)]) # via list
In [68]: arr
Out[68]: array([0, 1, 2, 3])
In [69]: np.array(range(4)) # more direct
Out[69]: array([0, 1, 2, 3])
In [70]: np.arange(4)
Out[70]: array([0, 1, 2, 3]) # faster
arr
too has a basic object storage with attributes like shape and dtype. It too has a databuffer, but for a numeric dtype like this, that buffer has actual numeric values (8 byte integers), not pointers to Python integer objects.In [71]: arr.nbytes
Out[71]: 32
That data buffer only takes 32 bytes - 4*8.For this small example it's not surprising that getsizeof
returns the same thing. The basic object storage is more significant than where the 4 values are stored. It's when working with 1000's of values, and multidimensional arrays that memory use is significantly different.
But more important is the calculation speeds. With an array you can do things like arr+1
or arr.sum()
. These operate in compiled code, and are quite fast. Similar list operations have to iterate, at slow Python speeds, though the pointers, fetching values etc. But doing the same sort of iteration on arrays is even slower.
As a general rule, if you start with lists, and do list operations such as append
and list comprehensions, it's best to stick with them.
But if you can create the arrays once, or from other arrays, and then use numpy
methods, you'll get 10x speed improvements. Arrays are indeed faster, but only if you use them in the right way. They aren't a simple drop in substitute for lists.
Memory usage increases when building a large NumPy array
Using the h5py
package, I can create an hdf5 file that contains a dataset that represents the a
array. The dset
variable is similar to the a
variable discussed in the question. This allows the array to reside on disk, not in memory. The generated hdf5 file is 8 GB on disk which is the size of the array containing np.float32
values. The elapsed time for this approach is similar to the examples discussed in the question; therefore, writing to the hdf5 file seems to have a negligible performance impact.
import numpy as np
import h5py
import time
def main():
rng = np.random.default_rng()
tic = time.perf_counter()
z = 500 # depth
x = 2000 # rows
y = 2000 # columns
f = h5py.File('file.hdf5', 'w')
dset = f.create_dataset('data', shape=(z, x, y), dtype=np.float32)
for i in range(z):
r = rng.standard_normal((x, y), dtype=np.float32)
dset[i, :, :] = r
toc = time.perf_counter()
print('elapsed time =', round(toc - tic, 2), 'sec')
s = np.float32().nbytes * (z * x * y) / 1e9 # where 1 GB = 1000 MB
print('calculated storage =', s, 'GB')
if __name__ == '__main__':
main()
Output from running this example on a MacBook Pro with 2.6 GHz 6-Core Intel Core i7 and 32 GB of RAM:elapsed time = 22.97 sec
calculated storage = 8.0 GB
Running the memory profiler for this example gives the plot shown below. The peak memory usage is about 100 MiB which is drastically lower than the examples demonstrated in the question.
Related Topics
Loading Initial Data with Django 1.7 and Data Migrations
Django: How to Manage Development and Production Settings
Getting Gradient of Model Output W.R.T Weights Using Keras
How to Get Tweets Older Than a Week (Using Tweepy or Other Python Libraries)
Pandas Read_CSV and Filter Columns with Usecols
Round Up to Second Decimal Place in Python
Python-Pandas: the Truth Value of a Series Is Ambiguous
Asyncio: How to Cancel a Future Been Run by an Executor
How to Avoid Infinite Recursion with Super()
Why "Numpy.Any" Has No Short-Circuit Mechanism
Is There an Platform Independent Equivalent of Os.Startfile()