Sparse Matrix Slicing Using List of Int

Sparse matrix slicing using list of int

I think I've recreated the csr row indexing with:

def extractor(indices, N):
   indptr=np.arange(len(indices)+1)
   data=np.ones(len(indices))
   shape=(len(indices),N)
   return sparse.csr_matrix((data,indices,indptr), shape=shape)

Testing on a csr I had hanging around:

In [185]: M
Out[185]: 
<30x40 sparse matrix of type '<class 'numpy.float64'>'
    with 76 stored elements in Compressed Sparse Row format>

In [186]: indices=np.r_[0:20]

In [187]: M[indices,:]
Out[187]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

In [188]: extractor(indices, M.shape[0])*M
Out[188]: 
<20x40 sparse matrix of type '<class 'numpy.float64'>'
    with 57 stored elements in Compressed Sparse Row format>

As with a number of other csr methods, it uses matrix multiplication to produce the final value. In this case with a sparse matrix with 1 in selected rows. Time is actually a bit better.

In [189]: timeit M[indices,:]
1000 loops, best of 3: 515 µs per loop
In [190]: timeit extractor(indices, M.shape[0])*M
1000 loops, best of 3: 399 µs per loop

In your case the extractor matrix is (len(training_indices),347) in shape, with only len(training_indices) values. So it is not big.

But if the matrix is so large (or at least the 2nd dimension so big) that it produces some error in the matrix multiplication routines, it could give rise to segmentation fault without python/numpy trapping it.

Does matrix.sum(axis=1) work. That too uses a matrix multiplication, though with a dense matrix of 1s. Or sparse.eye(347)*M, a similar size matrix multiplication?

Scipy sparse array from list of integers

Seems rather straightforward. Did I misunderstand the question?

from scipy import sparse

def cat_to_mat(c, mat_length):
    return sparse.csc_matrix((c, (len(c), c % mat_length)), shape=(len(c), mat_length))

How to slice a scipy sparse matrix and keep the original indexing?

Depending on the indexing, it might be easier to construct the extractor/indexing matrix with the coo style of inputs:

In [129]: from scipy import sparse
In [130]: M = sparse.csr_matrix(np.arange(16).reshape(4,4))
In [131]: M
Out[131]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 15 stored elements in Compressed Sparse Row format>
In [132]: M.A
Out[132]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

A square extractor matrix with the desired "diagonal" values:

In [133]: extractor = sparse.csr_matrix(([1,1],([0,3],[0,3])))
In [134]: extractor
Out[134]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>

Matrix multiplication in one direction selects columns:

In [135]: M@extractor
Out[135]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>
In [136]: _.A
Out[136]: 
array([[ 0,  0,  0,  3],
       [ 4,  0,  0,  7],
       [ 8,  0,  0, 11],
       [12,  0,  0, 15]])

and in the other, rows:

In [137]: extractor@M
Out[137]: 
<4x4 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>
In [138]: _.A
Out[138]: 
array([[ 0,  1,  2,  3],
       [ 0,  0,  0,  0],
       [ 0,  0,  0,  0],
       [12, 13, 14, 15]])
In [139]: extractor.A
Out[139]: 
array([[1, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 1]])

M[[0,3],:] does the same thing, but with:

In [140]: extractor = sparse.csr_matrix(([1,1],([0,1],[0,3])))
In [142]: (extractor@M).A
Out[142]: 
array([[ 0,  1,  2,  3],
       [12, 13, 14, 15]])

Row and column sums are also performed with matrix multiplication:

In [149]: M@np.ones(4,int)
Out[149]: array([ 6, 22, 38, 54])

Create an sparse matrix from a list of tuples having the indexes of the column where is a 1

Your values are the indices that are required for the compressed sparse column format. You'll also need the indptr array, which for your data is the cumulative sum of the lengths of the tuples (prepended with 0). The data array would be an array of ones with the same length as the sum of the lengths of the tuples, which you can get from the last element of the cumulative sum. Here's how that looks with your example:

In [45]: from scipy.sparse import csc_matrix

In [46]: list_tuples = [
    ...:  (0, 2, 4),
    ...:  (0, 2, 3),
    ...:  (1, 3, 4)
    ...: ]

In [47]: indices = sum(list_tuples, ())  # Flatten the tuples into one sequence.

In [48]: indptr = np.cumsum([0] + [len(t) for t in list_tuples])

In [49]: a = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr))

In [50]: a
Out[50]: 
<5x3 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in Compressed Sparse Column format>

In [51]: a.A
Out[51]: 
array([[1, 1, 0],
       [0, 0, 1],
       [1, 1, 0],
       [0, 1, 1],
       [1, 0, 1]])

Note that csc_matrix inferred the number of rows from the maximum that it found in the indices. You can use the shape parameter to override that, e.g.

In [52]: b = csc_matrix((np.ones(indptr[-1], dtype=int), indices, indptr), shape=(7, len(list_tuples)))

In [53]: b
Out[53]: 
<7x3 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in Compressed Sparse Column format>

In [54]: b.A
Out[54]: 
array([[1, 1, 0],
       [0, 0, 1],
       [1, 1, 0],
       [0, 1, 1],
       [1, 0, 1],
       [0, 0, 0],
       [0, 0, 0]])

You can also generate a coo_matrix pretty easily. The flattened list_tuples gives the row indices, and np.repeat can be used to create the column indices:

In [63]: from scipy.sparse import coo_matrix

In [64]: i = sum(list_tuples, ())  # row indices

In [65]: j = np.repeat(range(len(list_tuples)), [len(t) for t in list_tuples])

In [66]: c = coo_matrix((np.ones(len(i), dtype=int), (i, j)))

In [67]: c
Out[67]: 
<5x3 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in COOrdinate format>

In [68]: c.A
Out[68]: 
array([[1, 1, 0],
       [0, 0, 1],
       [1, 1, 0],
       [0, 1, 1],
       [1, 0, 1]])

Scipy sparse matrix from list of list with integers

I assume you want to have a 5 by 5 matrix at the end. also indices start from 0.

In [18]:import scipy.sparse as sp

In [20]: a = [[0,1,2],[0],[0,3,4]]
In [31]: m = sp.lil_matrix((5,5), dtype=int)

In [32]: for row_index, col_indices in enumerate(a):
    m[row_index, col_indices] = 1
   ....:     

In [33]: m.toarray()
Out[33]: 
array([[1, 1, 1, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 1, 1],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]])

Fast slicing and multiplication of scipy sparse CSR matrix

I explained in Sparse matrix slicing using list of int that this kind of row indexing is actually performed with matrix multiplication. In effect it constructs a sparse vector with 1's for the desired rows, and does the appropriate dot.

So I'm not surprised that the order of the operations doesn't matter much.

In general sparse matrices are not designed for efficient indexing. They don't, for example, return views. The csr matrix multiplication is one its most efficient operations. Even row or columns sums are performed with matrix multiplication.

Scipy create sparse row matrix from a list of indices and a list of list data

In [131]: data = np.array([[1,2,3],[4,5,6],[7,8,9]])
     ...: indices = np.array([0,0,3]) # row number, sum when duplicated

I corrected the indices for 0 based indexing.

We don't need sparse to sum the duplicates. There's a np.add.at that does this nicely:

In [135]: res = np.zeros((4,3),int)
In [136]: np.add.at(res, indices, data)
In [137]: res
Out[137]: 
array([[5, 7, 9],
       [0, 0, 0],
       [0, 0, 0],
       [7, 8, 9]])

If we make a csr from that:

In [141]: M = sparse.csr_matrix(res)
In [142]: M
Out[142]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [143]: M.data
Out[143]: array([5, 7, 9, 7, 8, 9])
In [144]: M.indices
Out[144]: array([0, 1, 2, 0, 1, 2], dtype=int32)
In [145]: M.indptr
Out[145]: array([0, 3, 3, 3, 6], dtype=int32)

To make a csr directly, it's often easier to use the coo style of inputs. They are easier to understand.

Those inputs are 3 1d arrays of the same size:

In [160]: data.ravel()
Out[160]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])
In [161]: row = np.repeat(indices,3)
In [162]: row
Out[162]: array([0, 0, 0, 0, 0, 0, 3, 3, 3])
In [163]: col = np.tile(np.arange(3),3)
In [164]: col
Out[164]: array([0, 1, 2, 0, 1, 2, 0, 1, 2])
In [165]: M1 = sparse.coo_matrix((data.ravel(),(rows, cols)))
In [166]: M1.data
Out[166]: array([1, 2, 3, 4, 5, 6, 7, 8, 9])

The coo format leaves the inputs as given; but on conversion to csr duplicates are summed.

In [168]: M2 = M1.tocsr()
In [169]: M2
Out[169]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
In [170]: M2.data
Out[170]: array([5, 7, 9, 7, 8, 9])
In [171]: M2.indices
Out[171]: array([0, 1, 2, 0, 1, 2], dtype=int32)
In [172]: M2.indptr
Out[172]: array([0, 3, 3, 3, 6], dtype=int32)

In [173]: M2.A
Out[173]: 
array([[5, 7, 9],
       [0, 0, 0],
       [0, 0, 0],
       [7, 8, 9]])

@Erik shows how to use the csr format directly:

In [174]: M3 =sparse.csr_matrix((data.ravel(), col, [0,6,6,6,9]))
In [175]: M3
Out[175]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 9 stored elements in Compressed Sparse Row format>
In [176]: M3.A
Out[176]: 
array([[5, 7, 9],
       [0, 0, 0],
       [0, 0, 0],
       [7, 8, 9]])
In [177]: M3.indices
Out[177]: array([0, 1, 2, 0, 1, 2, 0, 1, 2], dtype=int32)

Note this has 9 nonzero elements; it hasn't summed the duplicates for storage (though the .A display shows them summed). To sum, we need an extra step:

In [179]: M3.sum_duplicates()
In [180]: M3.data
Out[180]: array([5, 7, 9, 7, 8, 9])

Slicing Sparse Matrices in Scipy -- Which Types Work Best?

Ok, so I'm pretty sure the "right" way to do this is:
if you are slicing columns, use tocsc() and slice using a list/array of integers. Boolean vectors does not seem to do the trick with sparse matrices -- the way it does with ndarrays in numpy. Which means the answer is.

indices = np.where(bool_vect)[0]
out1 = M.tocsc()[:,indices]
out2 = M.tocsr()[indices,:]

But question: is this the best way? Is this in place?

In practice this does seem to be happening in place -- and it is much faster than prior attempts (using lil_matrix).

python: can I have a sparse matrix representation without (explicitly) using integer indices?

Assuming that no library does exactly what you want, you can create your own class SparseMatrix and overload the operator []. Heres is one way to do it (the constructor might be different to what you want to have):

class SparseMatrix():
    def __init__(self, x_label, y_label):
        self.data = {}
        for x,y in zip(x_label,y_label):
            print x,y
            self.data[x] = {}
            for attr in y:
                self.data[x][attr] = 1
        return

    def __getitem__(self, index):
        x,y = index
        if type(x) is str:
            if type(y) is str:
                return 1 if y in self.data[x] else 0
            if type(y) is slice:
                return self.data[x].keys()
        if type(x) is slice:
            if type(y) is str:
                res = []
                for key in self.data.keys():
                    if y in self.data[key]:
                        res.append(key)
                return res
            if type(y) is list:
                res = []
                for attr in y:
                    res += self.__getitem__((x,attr))
                return res

And in the REPL, I get:

> data = SparseMatrix(['john','jane','mike','joe'],[['likes_coffee','has_curly_hair'],['has_dark_hair'],['drives_car'],['man_u_fan']])

> data['john',:]
['has_curly_hair', 'likes_coffee']

> data[:,['has_curly_hair','drives_car']]
['john', 'mike']

Sparse Matrix Slicing Using List of Int