Save/Load Scipy Sparse Csr_Matrix in Portable Data Format

Save / load scipy sparse csr_matrix in portable data format

edit: scipy 0.19 now has scipy.sparse.save_npz and scipy.sparse.load_npz.

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

For both functions, the file argument may also be a file-like object (i.e. the result of open) instead of a filename.

Got an answer from the Scipy user group:

A csr_matrix has 3 data attributes that matter: .data, .indices, and .indptr. All are simple ndarrays, so numpy.save will work on them. Save the three arrays with numpy.save or numpy.savez, load them back with numpy.load, and then recreate the sparse matrix object with:

new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

So for example:

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

How to preserve order of insertion in SciPy Sparse Matrix CSR_Matrix?

The CSR format stores data in a row-wise format (by marking out the places in the memory-contiguous data array where each row begins and ends). The information that you want does not exist in that format - part of the compression is to remove it.

If you need that ordering information you could leave it in COO format with the caveat that there are operations which result in COO matrices being sorted without warning. It may be best to store that information explicitly instead of implicitly (do scipy sparse matrices let you use structs in the data matrix?).

save_npz method missing from scipy.sparse

Yes, scipy.sparse.save_npz / load_npz are new in version 0.19.0 http://scipy.github.io/devdocs/release.0.19.0.html

Efficiently populating a 2D SciPy sparse matrix

Scipy sparse matrices are pretty slow. This is especially true for the DOK matrices using inefficient hash-tables internally. However, reading/Setting a specific cell of a sparse matrices in a loop is insanely slow (eg. each access takes 10~15 us on my machine). You should avoid that like the plague.

One solution to solve this problem is to create an array of indices and values and write the values to the space matrix in vectorized way. The computation of the indices/values can be optimized with Numba. Here is an example:

@nb.njit
def gen_indexed_values(psi_gr, timeste, b):
    two_lambd = 2 * gl.lambd
    oneover4_times_1over_deltazsq = 0.25 / (gl.delta_z**2)
    oneover2_times_1over_deltazsq = 0.5 / (gl.delta_z**2)

    res_size = (gl.N_z_divs-1) * (gl.N_epsilon_divs-1)
    A_idx = np.empty(res_size, dtype=np.int32)
    A_values = np.empty((3, res_size), dtype=np.complex64)
    cur = 0

    for i in range(1, gl.N_z_divs):
        for j in range(1, gl.N_epsilon_divs):
            k = j*(gl.N_z_divs+1) + i
            ratio = 1 / ( (2*gl.lambd**2) * (j*gl.delta_epsilon)**two_lambd )

            A_idx[cur] = k
            A_values[0, cur] = (-1/gl.delta_time) - ratio * (j**2 - (gl.lambd-0.5)**2 * 0.5) - oneover2_times_1over_deltazsq - \
                                0.5 * Voft_OFF_imag(i, j, gl.m)
            A_values[1, cur] = ratio * ( j**2*0.5 - (gl.lambd-1)*j*0.5 )
            A_values[2, cur] = ratio * ( j**2*0.5 + (gl.lambd-1)*j*0.5 )

            b[k] = psi_gr[i,  j, timeste] * ( (-1/gl.delta_time) + ratio*j**2 - (gl.lambd-0.5)**2*0.5 + oneover2_times_1over_deltazsq + \
                                0.5 * Voft_OFF_imag(i, j, gl.m)  ) + \
                   psi_gr[i+1, j, timeste] * (-oneover4_times_1over_deltazsq) + \
                   psi_gr[i-1, j, timeste] * (-oneover4_times_1over_deltazsq) + \
                   psi_gr[i,   j+1, timeste] * ( -ratio * (j**2 * 0.5 - (gl.lambd-1) * j * 0.5)   ) + \
                   psi_gr[i,   j-1, timeste] * ( -ratio * (j**2 * 0.5 + (gl.lambd-1) * j * 0.5)   )

            cur += 1

    return A_idx, A_values

def populate_A_matrix_b_vector(psi_gr, timeste, A, b, sys_float_info_min):
    # Utilities:
    two_lambd = 2 * gl.lambd
    oneover4_times_1over_deltazsq = 0.25 / (gl.delta_z**2)
    oneover2_times_1over_deltazsq = 0.5 / (gl.delta_z**2)

    # For the interior nodes:
    A_idx, A_values = gen_indexed_values(psi_gr, timeste, b)
    A[A_idx, A_idx] = A_values[0,:]
    A[A_idx, A_idx+1] = oneover4_times_1over_deltazsq
    A[A_idx, A_idx-1] = oneover4_times_1over_deltazsq
    A[A_idx, A_idx + (gl.N_z_divs+1)] = A_values[1,:]
    A[A_idx, A_idx - (gl.N_z_divs+1)] = A_values[2,:]

    # [...]

This is about 22 times faster on my machine. Using another kind of sparse matrix may help to improve performance.

scipy `SparseEfficiencyWarning` when division on rows of csr_matrix

Your matrix:

In [208]: indptr = np.array([0, 2, 3, 6])
     ...: indices = np.array([0, 2, 2, 0, 1, 2])
     ...: data = np.array([1., 2., 3., 4., 5., 6.])
     ...: mat = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [209]: mat
Out[209]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [210]: mat.A
Out[210]: 
array([[1., 0., 2.],
       [0., 0., 3.],
       [4., 5., 6.]])

Simple division just changes the mat.data values, in-place:

In [211]: mat/= 3
In [212]: mat
Out[212]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [213]: mat *= 3
In [214]: mat.A
Out[214]: 
array([[1., 0., 2.],
       [0., 0., 3.],
       [4., 5., 6.]])

The RHS of your case produces a np.matrix object:

In [215]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[215]: 
matrix([[1. , 0. , 2. ],
        [0. , 0. , 1.5]])

Assigning that to the mat subset produces the warning:

In [216]: mat[np.array([0,1])] = _
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)

Your warning and mine occurs in the set step:

self._set_arrayXarray(i, j, x)

If I divide again I don't get the warning:

In [217]: mat[np.array([0,1])]/=np.array([[1],[2]])
In [218]: mat
Out[218]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse Row format>

Why? Because after the first assignment, mat has 9 non-zero terms, not the original six. So [217] doesn't change sparsity.

Convert mat back to 6 zeros, and we get the warning again:

In [219]: mat.eliminate_zeros()
In [220]: mat
Out[220]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [221]: mat[np.array([0,1])]/=np.array([[1],[2]])
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)

and a change in sparsity:

In [222]: mat
Out[222]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 9 stored elements in Compressed Sparse Row format>

Assigning a sparse matrix of [215] doesn't trigger the warning:

In [223]: mat.eliminate_zeros()
In [224]: m1=sparse.csr_matrix(mat[np.array([0,1])]/np.array([[1],[2]]))
In [225]: m1
Out[225]: 
<2x3 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
In [226]: mat[np.array([0,1])]=m1
In [227]: mat
Out[227]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

===

The [215] division is best seen as a ndarray action, not a sparse one:

In [232]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[232]: 
matrix([[1.     , 0.     , 2.     ],
        [0.     , 0.     , 0.09375]])
In [233]: mat[np.array([0,1])].todense()/np.array([[1],[2]])
Out[233]: 
matrix([[1.     , 0.     , 2.     ],
        [0.     , 0.     , 0.09375]])

The details of this division are found in sparse._base.py, mat._divide, with different actions depending whether the other is scalar, dense array, or sparse matrix. Sparse matrix division does not implement broadcasting.

As a general rule, matrix multiplication is the most efficient sparse calculation. In fact actions like row or column sum are implemented with it. And so are some forms of indexing. Element-wise calculations are ok if they can be applied to the M.data array without regard to row or column indices (e.g. square, power, scalar multiplication). M.multiply is element-wise, but without the full broadcasting power of dense arrays. Sparse division is even more limited.

edit

sklearn has some utilities to perform certain kinds of sparse actions that it needs, like scaling and normalizing.

In [274]: from sklearn.utils import sparsefuncs

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.utils

https://scikit-learn.org/stable/modules/generated/sklearn.utils.sparsefuncs.inplace_row_scale.html#sklearn.utils.sparsefuncs.inplace_row_scale

With the sample mat:

In [275]: mat
Out[275]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [276]: mat.A
Out[276]: 
array([[1.    , 0.    , 2.    ],
       [0.    , 0.    , 0.1875],
       [4.    , 5.    , 6.    ]])

Applying the row scaling:

In [277]: sparsefuncs.inplace_row_scale(mat,np.array([10,20,1]))
In [278]: mat
Out[278]: 
<3x3 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [279]: mat.A
Out[279]: 
array([[10.  ,  0.  , 20.  ],
       [ 0.  ,  0.  ,  3.75],
       [ 4.  ,  5.  ,  6.  ]])

The scaling array has to match in length. In your case you'd need to take the inverse of your [[1],[2]], and pad it with 1 to act on all rows.

Looking at the source I see it uses sparsefuncs.inplace_csr_row_scale. That in turn does:

X.data *= np.repeat(scale, np.diff(X.indptr))

The details of this action are:

In [283]: mat.indptr
Out[283]: array([0, 2, 3, 6], dtype=int32)
In [284]: np.diff(mat.indptr)
Out[284]: array([2, 1, 3], dtype=int32)
In [285]: np.repeat(np.array([10,20,1]), _)
Out[285]: array([10, 10, 20,  1,  1,  1])
In [286]: mat.data
Out[286]: array([100., 200.,  75.,   4.,   5.,   6.])

So it converts the scale array into an array that matches the data in shape. Then the inplace *= array multiplication is easy.

numpy.ndarray sparse matrix to dense

It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.

RandomForestClassifier can run using data in this format.
The code has been running for 1:30h now, so hopefully it will actually finish :-)

Save/Load Scipy Sparse Csr_Matrix in Portable Data Format