Save / load scipy sparse csr_matrix in portable data format
edit: scipy 0.19 now has scipy.sparse.save_npz
and scipy.sparse.load_npz
.
from scipy import sparse
sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")
For both functions, the file
argument may also be a file-like object (i.e. the result of open
) instead of a filename.
Got an answer from the Scipy user group:
A csr_matrix has 3 data attributes that matter:
.data
,.indices
, and.indptr
. All are simple ndarrays, sonumpy.save
will work on them. Save the three arrays withnumpy.save
ornumpy.savez
, load them back withnumpy.load
, and then recreate the sparse matrix object with:
new_csr = csr_matrix((data, indices, indptr), shape=(M, N))
So for example:
def save_sparse_csr(filename, array):
np.savez(filename, data=array.data, indices=array.indices,
indptr=array.indptr, shape=array.shape)
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
shape=loader['shape'])
How to preserve order of insertion in SciPy Sparse Matrix CSR_Matrix?
The CSR format stores data in a row-wise format (by marking out the places in the memory-contiguous data array where each row begins and ends). The information that you want does not exist in that format - part of the compression is to remove it.
If you need that ordering information you could leave it in COO format with the caveat that there are operations which result in COO matrices being sorted without warning. It may be best to store that information explicitly instead of implicitly (do scipy sparse matrices let you use structs in the data matrix?).
save_npz method missing from scipy.sparse
Yes, scipy.sparse.save_npz / load_npz
are new in version 0.19.0 http://scipy.github.io/devdocs/release.0.19.0.html
Efficiently populating a 2D SciPy sparse matrix
Scipy sparse matrices are pretty slow. This is especially true for the DOK matrices using inefficient hash-tables internally. However, reading/Setting a specific cell of a sparse matrices in a loop is insanely slow (eg. each access takes 10~15 us on my machine). You should avoid that like the plague.
One solution to solve this problem is to create an array of indices and values and write the values to the space matrix in vectorized way. The computation of the indices/values can be optimized with Numba. Here is an example:
@nb.njit
def gen_indexed_values(psi_gr, timeste, b):
two_lambd = 2 * gl.lambd
oneover4_times_1over_deltazsq = 0.25 / (gl.delta_z**2)
oneover2_times_1over_deltazsq = 0.5 / (gl.delta_z**2)
res_size = (gl.N_z_divs-1) * (gl.N_epsilon_divs-1)
A_idx = np.empty(res_size, dtype=np.int32)
A_values = np.empty((3, res_size), dtype=np.complex64)
cur = 0
for i in range(1, gl.N_z_divs):
for j in range(1, gl.N_epsilon_divs):
k = j*(gl.N_z_divs+1) + i
ratio = 1 / ( (2*gl.lambd**2) * (j*gl.delta_epsilon)**two_lambd )
A_idx[cur] = k
A_values[0, cur] = (-1/gl.delta_time) - ratio * (j**2 - (gl.lambd-0.5)**2 * 0.5) - oneover2_times_1over_deltazsq - \
0.5 * Voft_OFF_imag(i, j, gl.m)
A_values[1, cur] = ratio * ( j**2*0.5 - (gl.lambd-1)*j*0.5 )
A_values[2, cur] = ratio * ( j**2*0.5 + (gl.lambd-1)*j*0.5 )
b[k] = psi_gr[i, j, timeste] * ( (-1/gl.delta_time) + ratio*j**2 - (gl.lambd-0.5)**2*0.5 + oneover2_times_1over_deltazsq + \
0.5 * Voft_OFF_imag(i, j, gl.m) ) + \
psi_gr[i+1, j, timeste] * (-oneover4_times_1over_deltazsq) + \
psi_gr[i-1, j, timeste] * (-oneover4_times_1over_deltazsq) + \
psi_gr[i, j+1, timeste] * ( -ratio * (j**2 * 0.5 - (gl.lambd-1) * j * 0.5) ) + \
psi_gr[i, j-1, timeste] * ( -ratio * (j**2 * 0.5 + (gl.lambd-1) * j * 0.5) )
cur += 1
return A_idx, A_values
def populate_A_matrix_b_vector(psi_gr, timeste, A, b, sys_float_info_min):
# Utilities:
two_lambd = 2 * gl.lambd
oneover4_times_1over_deltazsq = 0.25 / (gl.delta_z**2)
oneover2_times_1over_deltazsq = 0.5 / (gl.delta_z**2)
# For the interior nodes:
A_idx, A_values = gen_indexed_values(psi_gr, timeste, b)
A[A_idx, A_idx] = A_values[0,:]
A[A_idx, A_idx+1] = oneover4_times_1over_deltazsq
A[A_idx, A_idx-1] = oneover4_times_1over_deltazsq
A[A_idx, A_idx + (gl.N_z_divs+1)] = A_values[1,:]
A[A_idx, A_idx - (gl.N_z_divs+1)] = A_values[2,:]
# [...]
This is about 22 times faster on my machine. Using another kind of sparse matrix may help to improve performance.
scipy `SparseEfficiencyWarning` when division on rows of csr_matrix
Your matrix:
In [208]: indptr = np.array([0, 2, 3, 6])
...: indices = np.array([0, 2, 2, 0, 1, 2])
...: data = np.array([1., 2., 3., 4., 5., 6.])
...: mat = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
In [209]: mat
Out[209]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [210]: mat.A
Out[210]:
array([[1., 0., 2.],
[0., 0., 3.],
[4., 5., 6.]])
Simple division just changes the mat.data
values, in-place:
In [211]: mat/= 3
In [212]: mat
Out[212]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [213]: mat *= 3
In [214]: mat.A
Out[214]:
array([[1., 0., 2.],
[0., 0., 3.],
[4., 5., 6.]])
The RHS of your case produces a np.matrix
object:
In [215]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[215]:
matrix([[1. , 0. , 2. ],
[0. , 0. , 1.5]])
Assigning that to the mat
subset produces the warning:
In [216]: mat[np.array([0,1])] = _
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
Your warning and mine occurs in the set step:
self._set_arrayXarray(i, j, x)
If I divide again I don't get the warning:
In [217]: mat[np.array([0,1])]/=np.array([[1],[2]])
In [218]: mat
Out[218]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>
Why? Because after the first assignment, mat
has 9 non-zero terms, not the original six. So [217] doesn't change sparsity.
Convert mat
back to 6 zeros, and we get the warning again:
In [219]: mat.eliminate_zeros()
In [220]: mat
Out[220]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [221]: mat[np.array([0,1])]/=np.array([[1],[2]])
/usr/local/lib/python3.8/dist-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
self._set_arrayXarray(i, j, x)
and a change in sparsity:
In [222]: mat
Out[222]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 9 stored elements in Compressed Sparse Row format>
Assigning a sparse matrix of [215] doesn't trigger the warning:
In [223]: mat.eliminate_zeros()
In [224]: m1=sparse.csr_matrix(mat[np.array([0,1])]/np.array([[1],[2]]))
In [225]: m1
Out[225]:
<2x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
In [226]: mat[np.array([0,1])]=m1
In [227]: mat
Out[227]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
===
The [215] division is best seen as a ndarray
action, not a sparse one:
In [232]: mat[np.array([0,1])]/np.array([[1],[2]])
Out[232]:
matrix([[1. , 0. , 2. ],
[0. , 0. , 0.09375]])
In [233]: mat[np.array([0,1])].todense()/np.array([[1],[2]])
Out[233]:
matrix([[1. , 0. , 2. ],
[0. , 0. , 0.09375]])
The details of this division are found in sparse._base.py
, mat._divide
, with different actions depending whether the other
is scalar, dense array, or sparse matrix. Sparse matrix division does not implement broadcasting
.
As a general rule, matrix multiplication is the most efficient sparse calculation. In fact actions like row or column sum are implemented with it. And so are some forms of indexing. Element-wise calculations are ok if they can be applied to the M.data
array without regard to row or column indices (e.g. square, power, scalar multiplication). M.multiply
is element-wise, but without the full broadcasting power of dense arrays. Sparse division is even more limited.
edit
sklearn
has some utilities to perform certain kinds of sparse
actions that it needs, like scaling and normalizing.
In [274]: from sklearn.utils import sparsefuncs
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.utils
https://scikit-learn.org/stable/modules/generated/sklearn.utils.sparsefuncs.inplace_row_scale.html#sklearn.utils.sparsefuncs.inplace_row_scale
With the sample mat
:
In [275]: mat
Out[275]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [276]: mat.A
Out[276]:
array([[1. , 0. , 2. ],
[0. , 0. , 0.1875],
[4. , 5. , 6. ]])
Applying the row scaling:
In [277]: sparsefuncs.inplace_row_scale(mat,np.array([10,20,1]))
In [278]: mat
Out[278]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
In [279]: mat.A
Out[279]:
array([[10. , 0. , 20. ],
[ 0. , 0. , 3.75],
[ 4. , 5. , 6. ]])
The scaling array has to match in length. In your case you'd need to take the inverse of your [[1],[2]]
, and pad it with 1 to act on all rows.
Looking at the source I see it uses sparsefuncs.inplace_csr_row_scale
. That in turn does:
X.data *= np.repeat(scale, np.diff(X.indptr))
The details of this action are:
In [283]: mat.indptr
Out[283]: array([0, 2, 3, 6], dtype=int32)
In [284]: np.diff(mat.indptr)
Out[284]: array([2, 1, 3], dtype=int32)
In [285]: np.repeat(np.array([10,20,1]), _)
Out[285]: array([10, 10, 20, 1, 1, 1])
In [286]: mat.data
Out[286]: array([100., 200., 75., 4., 5., 6.])
So it converts the scale
array into an array that matches the data
in shape. Then the inplace *=
array multiplication is easy.
numpy.ndarray sparse matrix to dense
It seems that the data should have been saved using SciPy's sparse as mentioned here Save / load scipy sparse csr_matrix in portable data format. When using NumPy's save/load more data should have been saved.
RandomForestClassifier
can run using data in this format.
The code has been running for 1:30h now, so hopefully it will actually finish :-)
Related Topics
Set Markers for Individual Points on a Line in Matplotlib
Pandas Group by and Find First Non Null Value for All Columns
Getting Started with the Python Debugger, Pdb
How to Use Groupby to Concatenate Strings in Python Pandas
Python - Using Pandas Structures with Large CSV(Iterate and Chunksize)
Syntaxerror Inconsistency in Python
Is It Still Necessary to Install Cuda Before Using the Conda Tensorflow-Gpu Package
Subprocess Readline Hangs Waiting for Eof
Python's JSON Module, Converts Int Dictionary Keys to Strings
Pandas Fill Missing Values in Dataframe from Another Dataframe
Difference Between Static Static_Url and Static_Root on Django
Group by Pandas Dataframe and Select Latest in Each Group
Multiple Models in a Single Django Modelform
The Problem with Installing Pil Using Virtualenv or Buildout