How to create a large matrix of matrices in python?
In numpy
there are two relevant structures.
One is a 4dimensional array, e.g. np.zeros((100,100,5,5),int)
. The other is an 2 dimensional array of objects. np.zeros((100,100),dtype=object)
. With object array, the elements can be anythings - strings, numbers, lists, your 5x5 arrays, other 7x3 array, None
, etc).
It is easiest to do math on the 4d array, for example taking the mean across all the 5x5 subarrays, or finding the [:,:,0,0]
corner of all.
If your subarrays are all 5x5, it can be tricky to create and fill that object
array. np.array(...)
tries to create that 4dim array if possible.
With h5py
you can chunk the file, and access portions of the larger array. But you still have to have a workable numpy representation to do anything with them.
How to handle large matrices and matrix multiplication in Python
If B
is diagonal, you don't need to use sparse to save memory. You can do the calculation with a 1d array, just the diagonal values.
I'll demonstrate with small dimensions, where making a full B
is doesn't cause problems. Others can test this with large dimensions.
In [5]: A = np.arange(24).reshape(3,8)
In [7]: B = np.diag(np.arange(8))
In [8]: B.shape
Out[8]: (8, 8)
The double matmul:
In [10]: A@B@A.T
Out[10]:
array([[ 784, 1904, 3024],
[ 1904, 4816, 7728],
[ 3024, 7728, 12432]])
The equivalent einsum
:
In [12]: np.einsum('ij,jk,lk',A,B,A)
Out[12]:
array([[ 784, 1904, 3024],
[ 1904, 4816, 7728],
[ 3024, 7728, 12432]])
The 1d diagonal of B
:
In [15]: b = np.diag(B)
A broadcasted multiplication does the same thing as the matmul:
In [17]: np.allclose(A*b,A@B)
Out[17]: True
In [18]: np.allclose(A*b@A.T,A@B@A.T)
Out[18]: True
Expressing that with einsum
:
In [19]: np.einsum('ij,j,lj',A,b,A)
Out[19]:
array([[ 784, 1904, 3024],
[ 1904, 4816, 7728],
[ 3024, 7728, 12432]])
Some comparative times:
In [20]: timeit np.einsum('ij,j,lj',A,b,A)
10.6 µs ± 22.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [21]: timeit np.einsum('ij,jk,lk',A,B,A)
15.2 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [22]: timeit A@B@A.T
7.26 µs ± 20.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [23]: timeit A*b@A.T
8.17 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
For einsum
the 'condensed' version is faster. With matmul
the full diagonal is marginally faster.
But with large arrays where creating a full B
might a problem, using b
might be faster. Also it's been observed in other SO that iterations on smaller arrays can be faster, due to better memory handling.
np.array([A[i,:]*b@A.T for i in range(3)])
Python NUMPY HUGE Matrices multiplication
One thing that you can do to speed things up is compile numpy with an optimized BLAS library like e.g. ATLAS, GOTO blas or Intel's proprietary MKL.
To calculate the memory needed, you need to monitor Python's Resident Set Size ("RSS"). The following commands were run on a UNIX system (FreeBSD to be precise, on a 64-bit machine).
> ipython
In [1]: import numpy as np
In [2]: a = np.random.rand(1000, 1000)
In [3]: a.dtype
Out[3]: dtype('float64')
In [4]: del(a)
To get the RSS I ran:
ps -xao comm,rss | grep python
[Edit: See the ps
manual page for a complete explanation of the options, but basically these ps
options make it show only the command and resident set size of all processes. The equivalent format for Linux's ps
would be ps -xao c,r
, I believe.]
The results are;
- After starting the interpreter: 24880 kiB
- After importing numpy: 34364 kiB
- After creating
a
: 42200 kiB - After deleting
a
: 34368 kiB
Calculating the size;
In [4]: (42200 - 34364) * 1024
Out[4]: 8024064
In [5]: 8024064/(1000*1000)
Out[5]: 8.024064
As you can see, the calculated size matches the 8 bytes for the default datatype float64
quite well. The difference is internal overhead.
The size of your original arrays in MiB will be approximately;
In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125
In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375
That's not too bad. However, the dot product will be way too large:
In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484
That's 2235 GiB!
What you can do is split up the problem and perfrom the dot
operation in pieces;
- load
b
as an ndarray - load every row from
a
as anndarray
in turn. - multiply the row by every column of
b
and write the result to a file. del()
the row and load the next row.
This wil not make it faster, but it would make it use less memory!
Edit: In this case I would suggest writing the output file in binary format (e.g. using struct
or ndarray.tofile
). That would make it much easier to read a column from the file with e.g. a numpy.memmap
.
Handling large dense matrices in python
PyTables can handle tables of arbitrary size (Millions of columns!), through using memmap and some clever compression.
Ostensibly, it provides SQL like performance, to python. It will, however, require significant code modifications.
I'm not going to accept this answer until I've done a more thorough vetting, to ensure it can actually do what I want. Or someone provides a better solution.
Related Topics
Pandas: Drop Consecutive Duplicates
How to Jump to a Particular Line in a Huge Text File
How to .Decode('String-Escape') in Python 3
How to Include Package Data with Setuptools/Distutils
How to Detach Matplotlib Plots So That the Computation Can Continue
Pyqt: Connecting a Signal to a Slot to Start a Background Operation
Specifying and Saving a Figure with Exact Size in Pixels
Python Selenium Wait for Several Elements to Load
Resetting Generator Object in Python
How to Convert String to Datetime Format in Pandas Python
How to Read a Text File into a List or an Array with Python
Mkdir -P Functionality in Python
Prepend Line to Beginning of a File
Printing All Instances of a Class
Pylint "Unresolved Import" Error in Visual Studio Code
Writing a Connection String When Password Contains Special Characters