Very Large Matrices Using Python and Numpy

How to create a large matrix of matrices in python?

In numpy there are two relevant structures.

One is a 4dimensional array, e.g. np.zeros((100,100,5,5),int). The other is an 2 dimensional array of objects. np.zeros((100,100),dtype=object). With object array, the elements can be anythings - strings, numbers, lists, your 5x5 arrays, other 7x3 array, None, etc).

It is easiest to do math on the 4d array, for example taking the mean across all the 5x5 subarrays, or finding the [:,:,0,0] corner of all.

If your subarrays are all 5x5, it can be tricky to create and fill that object array. np.array(...) tries to create that 4dim array if possible.

With h5py you can chunk the file, and access portions of the larger array. But you still have to have a workable numpy representation to do anything with them.

How to handle large matrices and matrix multiplication in Python

If B is diagonal, you don't need to use sparse to save memory. You can do the calculation with a 1d array, just the diagonal values.

I'll demonstrate with small dimensions, where making a full B is doesn't cause problems. Others can test this with large dimensions.

In [5]: A = np.arange(24).reshape(3,8)
In [7]: B = np.diag(np.arange(8))
In [8]: B.shape
Out[8]: (8, 8)

The double matmul:

In [10]: A@B@A.T
Out[10]:
array([[ 784, 1904, 3024],
[ 1904, 4816, 7728],
[ 3024, 7728, 12432]])

The equivalent einsum:

In [12]: np.einsum('ij,jk,lk',A,B,A)
Out[12]:
array([[ 784, 1904, 3024],
[ 1904, 4816, 7728],
[ 3024, 7728, 12432]])

The 1d diagonal of B:

In [15]: b = np.diag(B)

A broadcasted multiplication does the same thing as the matmul:

In [17]: np.allclose(A*b,A@B)
Out[17]: True
In [18]: np.allclose(A*b@A.T,A@B@A.T)
Out[18]: True

Expressing that with einsum:

In [19]: np.einsum('ij,j,lj',A,b,A)
Out[19]:
array([[ 784, 1904, 3024],
[ 1904, 4816, 7728],
[ 3024, 7728, 12432]])

Some comparative times:

In [20]: timeit np.einsum('ij,j,lj',A,b,A)
10.6 µs ± 22.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [21]: timeit np.einsum('ij,jk,lk',A,B,A)
15.2 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [22]: timeit A@B@A.T
7.26 µs ± 20.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [23]: timeit A*b@A.T
8.17 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

For einsum the 'condensed' version is faster. With matmul the full diagonal is marginally faster.

But with large arrays where creating a full B might a problem, using b might be faster. Also it's been observed in other SO that iterations on smaller arrays can be faster, due to better memory handling.

np.array([A[i,:]*b@A.T for i in range(3)])

Python NUMPY HUGE Matrices multiplication

One thing that you can do to speed things up is compile numpy with an optimized BLAS library like e.g. ATLAS, GOTO blas or Intel's proprietary MKL.

To calculate the memory needed, you need to monitor Python's Resident Set Size ("RSS"). The following commands were run on a UNIX system (FreeBSD to be precise, on a 64-bit machine).

> ipython

In [1]: import numpy as np

In [2]: a = np.random.rand(1000, 1000)

In [3]: a.dtype
Out[3]: dtype('float64')

In [4]: del(a)

To get the RSS I ran:

ps -xao comm,rss | grep python

[Edit: See the ps manual page for a complete explanation of the options, but basically these ps options make it show only the command and resident set size of all processes. The equivalent format for Linux's ps would be ps -xao c,r, I believe.]

The results are;

  • After starting the interpreter: 24880 kiB
  • After importing numpy: 34364 kiB
  • After creating a: 42200 kiB
  • After deleting a: 34368 kiB

Calculating the size;

In [4]: (42200 - 34364) * 1024
Out[4]: 8024064

In [5]: 8024064/(1000*1000)
Out[5]: 8.024064

As you can see, the calculated size matches the 8 bytes for the default datatype float64 quite well. The difference is internal overhead.

The size of your original arrays in MiB will be approximately;

In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125

In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375

That's not too bad. However, the dot product will be way too large:

In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484

That's 2235 GiB!

What you can do is split up the problem and perfrom the dot operation in pieces;

  • load b as an ndarray
  • load every row from a as an ndarray in turn.
  • multiply the row by every column of b and write the result to a file.
  • del() the row and load the next row.

This wil not make it faster, but it would make it use less memory!

Edit: In this case I would suggest writing the output file in binary format (e.g. using struct or ndarray.tofile). That would make it much easier to read a column from the file with e.g. a numpy.memmap.

Handling large dense matrices in python

PyTables can handle tables of arbitrary size (Millions of columns!), through using memmap and some clever compression.

Ostensibly, it provides SQL like performance, to python. It will, however, require significant code modifications.

I'm not going to accept this answer until I've done a more thorough vetting, to ensure it can actually do what I want. Or someone provides a better solution.



Related Topics



Leave a reply



Submit