Populate a Pandas Sparsedataframe from a Scipy Sparse Matrix

Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix

A direct conversion is not supported ATM. Contributions are welcome!

Try this, should be ok on memory as the SpareSeries is much like a csc_matrix (for 1 column)
and pretty space efficient

In [37]: col = np.array([0,0,1,2,2,2])

In [38]: data = np.array([1,2,3,4,5,6],dtype='float64')

In [39]: m = csc_matrix( (data,(row,col)), shape=(3,3) )

In [40]: m
<3x3 sparse matrix of type '<type 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Column format>

In [46]: pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
0 1 2
0 1 0 4
1 0 0 5
2 2 3 6

In [47]: df = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])

In [48]: type(df)
Out[48]: pandas.sparse.frame.SparseDataFrame

Populate a Pandas SparseDataFrame from a SciPy Sparse Coo Matrix


A convenience method SparseSeries.from_coo() is implemented for creating a SparseSeries from a scipy.sparse.coo_matrix.

Within scipy.sparse there are methods that convert the data forms to each other. .tocoo, .tocsc, etc. So you can use which ever form is best for a particular operation.

For going the other way, I've answered

Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory

Your linked answer from 2013 iterates by row - using toarray to make the row dense. I haven't looked at what the pandas from_coo does.

A more recent SO question on pandas sparse

non-NDFFrame object error using pandas.SparseSeries.from_coo() function

From https://github.com/pydata/pandas/blob/master/pandas/sparse/scipy_sparse.py

def _coo_to_sparse_series(A, dense_index=False):
""" Convert a scipy.sparse.coo_matrix to a SparseSeries.
Use the defaults given in the SparseSeries constructor. """
s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
s = s.sort_index()
s = s.to_sparse() # TODO: specify kind?
# ...
return s

In effect it takes the same data, i, j used to build a coo matrix, makes a series, sorts it, and turns it into a sparse series.

How do I create a scipy sparse matrix from a pandas dataframe?

I don't have pandas installed, so can't start with a dataframe. But lets assume I have extracted a numpy array from dataframe (doesn't a method or attribute like values do that?):

In [40]: D
array([[4109, 2093], # could be other columns
[6633, 2093],
[6634, 2094],
[6635, 2095]])

Making a sparse matrix from that is straight forward - I just need to extract or construct the 3 arrays:

In [41]: M=sparse.coo_matrix((D[:,1], (D[:,0], np.zeros(D.shape[0]))),

In [42]: M
<7000x1 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in COOrdinate format>

In [43]: print(M)
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095


Generalized to two 'data' columns

In [70]: D
array([[4109, 2093, 128],
[6633, 2093, 129],
[6634, 2094, 127],
[6635, 2095, 126]])

In [76]: i,j,data=[],[],[]

In [77]: for col in range(1,D.shape[1]):

In [78]: i
Out[78]: [4109, 6633, 6634, 6635, 4109, 6633, 6634, 6635]

In [79]: j
Out[79]: [0, 0, 0, 0, 1, 1, 1, 1]

In [80]: data
Out[80]: [2093, 2093, 2094, 2095, 128, 129, 127, 126]

In [83]: M=sparse.coo_matrix((data,(i,j)),shape=(7000,D.shape[1]-1))

In [84]: M
<7000x2 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in COOrdinate format>

In [85]: print(M)
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
(4109, 1) 128
(6633, 1) 129
(6634, 1) 127
(6635, 1) 126

I suspect you could also make separate matrices for each column, and combine them with the sparse.bmat (block) mechanism, but I'm most familiar with the coo format.

Compiling n submatrices into an NxN matrix in numpy

for another example of building a large sparse matrix from submatrices (here they overlap). There I found a way of joining the blocks with a faster array operation. It might be possible to do that here. But I suspect that the iteration over a few columns (and extend over many rows) is is ok speed wise.

With bmat I could construct the same thing as:

In [98]: I, J = D[:,0], np.zeros(D.shape[0],int)

In [99]: M1=sparse.coo_matrix((D[:,1],(I, J)), shape=(7000,1))
In [100]: M2=sparse.coo_matrix((D[:,2],(I, J)), shape=(7000,1))

In [101]: print(sparse.bmat([[M1,M2]]))
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
(4109, 1) 128
(6633, 1) 129
(6634, 1) 127
(6635, 1) 126

Creating a sparse matrix from pandas data frame using scipy.sparse

Try get_dummies with sparse=True and maybe use dtype='i8' (optional) for less memory use

out = pd.get_dummies(df.set_index("X")['Y'],sparse=True,dtype='i8').max(level=0)


10 14 15
1256 1 1 1
3087 0 1 1
2199 1 1 0
1056 1 0 0
408 0 0 1

transform scipy sparse csr to pandas?

If A is csr_matrix, you can use .toarray() (there's also .todense() that produces a numpy matrix, which is also works for the DataFrame constructor):

df = pd.DataFrame(A.toarray())

You can then use this with pd.concat().

A = csr_matrix([[1, 0, 2], [0, 3, 0]])

(0, 0) 1
(0, 2) 2
(1, 1) 3

<class 'scipy.sparse.csr.csr_matrix'>


0 1 2
0 1 0 2
1 0 3 0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null int64
1 2 non-null int64
2 2 non-null int64

In version 0.20, pandas introduced sparse data structures, including the SparseDataFrame.

In pandas 1.0, SparseDataFrame was removed:

In older versions of pandas, the SparseSeries and SparseDataFrame classes were the preferred way to work with sparse data. With the advent of extension arrays, these subclasses are no longer needed. Their purpose is better served by using a regular Series or DataFrame with sparse values instead.

The migration guide shows how to use these new data structures.

For instance, to create a DataFrame from a sparse matrix:

from scipy.sparse import csr_matrix

A = csr_matrix([[1, 0, 2], [0, 3, 0]])

df = pd.DataFrame.sparse.from_spmatrix(A, columns=['A', 'B', 'C'])


0 1 0 2
1 0 3 0

A Sparse[float64, 0]
B Sparse[float64, 0]
C Sparse[float64, 0]
dtype: object

Alternatively, you can pass sparse matrices to sklearn to avoid running out of memory when converting back to pandas. Just convert your other data to sparse format by passing a numpy array to the scipy.sparse.csr_matrix constructor and use scipy.sparse.hstack to combine (see docs).

Scipy sparse matrix as DataFrame column

There is a sparse dataframe or dataseries feature. It is still experimental. I've answered a few SO questions about converting back and forth between that and scipy sparse matrices.

From the sidebar:

Populate a Pandas SparseDataFrame from a SciPy Sparse Coo Matrix

Without such a specialized pandas structure I don't see how a sparse matrix could be added to a pandas frame. The internal structure of a sparse matrix is too different. For a start it is not a subclass of numpy array.

A csr matrix is an object with data contained in 3 arrays, ma.data and ma.indices are 1d arrays with one value for each non-zero element of the array. ma.indptr has a value for each row of the matrix.

list(ma) is meaningless. ma.toarray() produces a 2d array with the same data, and will all those zeros filled in as well.

Other sparse matrix formats store their data in other structures - 3 equal length arrays for coo, two lists of lists for lil, and a dictionary of dok.

How to convert pandas dataframe to a sparse matrix using scipy's csr_matrix?

IIUC and using the third link you shared, you can convert your df data to sparse data using pd.SparseDtype, like this

df_sparsed = df.astype(pd.SparseDtype("float", np.nan)

You can read more about pd.SparseDtype here to choose right parameters for your data and then use it in your above command like this:

csr_matrix(df_sparsed.sparse.to_coo()) # Note you need .sparse accessor to access .to_coo()

Simple one liner will be

csr_matrix(df.astype(pd.SparseDtype("float", np.nan)).sparse.to_coo())

How to create a sparse DataFrame from a list of dicts

I suggest to use the dytpe='Sparse' for this.

If all elements are numbers you can use dytpe='Sparse', dytpe='Sparse[int]' or dytpe='Sparse[float]'

data = [{"id":'a',"v0":3,"v2":6},
index = [item.pop('id') for item in data]
pd.DataFrame(data, index=index, dtype="Sparse")

If any value is a string you have to use dytpe='Sparse[str]'.

data = [{"id":'a',"v0":3,"v2":'foo'},
df = pd.DataFrame(data, dtype="Sparse[str]").set_index("id",verify_integrity=True)

Related Topics

Leave a reply