Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix
A direct conversion is not supported ATM. Contributions are welcome!
Try this, should be ok on memory as the SpareSeries is much like a csc_matrix (for 1 column)
and pretty space efficient
In [37]: col = np.array([0,0,1,2,2,2])
In [38]: data = np.array([1,2,3,4,5,6],dtype='float64')
In [39]: m = csc_matrix( (data,(row,col)), shape=(3,3) )
In [40]: m
Out[40]:
<3x3 sparse matrix of type '<type 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Column format>
In [46]: pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
Out[46]:
0 1 2
0 1 0 4
1 0 0 5
2 2 3 6
In [47]: df = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel())
for i in np.arange(m.shape[0]) ])
In [48]: type(df)
Out[48]: pandas.sparse.frame.SparseDataFrame
Populate a Pandas SparseDataFrame from a SciPy Sparse Coo Matrix
http://pandas.pydata.org/pandas-docs/stable/sparse.html#interaction-with-scipy-sparse
A convenience method SparseSeries.from_coo() is implemented for creating a SparseSeries from a scipy.sparse.coo_matrix.
Within scipy.sparse
there are methods that convert the data forms to each other. .tocoo
, .tocsc
, etc. So you can use which ever form is best for a particular operation.
For going the other way, I've answered
Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory
Your linked answer from 2013 iterates by row - using toarray
to make the row dense. I haven't looked at what the pandas from_coo
does.
A more recent SO question on pandas sparse
non-NDFFrame object error using pandas.SparseSeries.from_coo() function
From https://github.com/pydata/pandas/blob/master/pandas/sparse/scipy_sparse.py
def _coo_to_sparse_series(A, dense_index=False):
""" Convert a scipy.sparse.coo_matrix to a SparseSeries.
Use the defaults given in the SparseSeries constructor. """
s = Series(A.data, MultiIndex.from_arrays((A.row, A.col)))
s = s.sort_index()
s = s.to_sparse() # TODO: specify kind?
# ...
return s
In effect it takes the same data
, i
, j
used to build a coo
matrix, makes a series, sorts it, and turns it into a sparse series.
How do I create a scipy sparse matrix from a pandas dataframe?
I don't have pandas
installed, so can't start with a dataframe. But lets assume I have extracted a numpy array from dataframe
(doesn't a method or attribute like values
do that?):
In [40]: D
Out[40]:
array([[4109, 2093], # could be other columns
[6633, 2093],
[6634, 2094],
[6635, 2095]])
Making a sparse matrix from that is straight forward - I just need to extract or construct the 3 arrays:
In [41]: M=sparse.coo_matrix((D[:,1], (D[:,0], np.zeros(D.shape[0]))),
shape=(7000,1))
In [42]: M
Out[42]:
<7000x1 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in COOrdinate format>
In [43]: print(M)
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
=======================
Generalized to two 'data' columns
In [70]: D
Out[70]:
array([[4109, 2093, 128],
[6633, 2093, 129],
[6634, 2094, 127],
[6635, 2095, 126]])
In [76]: i,j,data=[],[],[]
In [77]: for col in range(1,D.shape[1]):
i.extend(D[:,0])
j.extend(np.zeros(D.shape[0],int)+(col-1))
data.extend(D[:,col])
....:
In [78]: i
Out[78]: [4109, 6633, 6634, 6635, 4109, 6633, 6634, 6635]
In [79]: j
Out[79]: [0, 0, 0, 0, 1, 1, 1, 1]
In [80]: data
Out[80]: [2093, 2093, 2094, 2095, 128, 129, 127, 126]
In [83]: M=sparse.coo_matrix((data,(i,j)),shape=(7000,D.shape[1]-1))
In [84]: M
Out[84]:
<7000x2 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in COOrdinate format>
In [85]: print(M)
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
(4109, 1) 128
(6633, 1) 129
(6634, 1) 127
(6635, 1) 126
I suspect you could also make separate matrices for each column, and combine them with the sparse.bmat
(block) mechanism, but I'm most familiar with the coo
format.
See
Compiling n submatrices into an NxN matrix in numpy
for another example of building a large sparse matrix from submatrices (here they overlap). There I found a way of joining the blocks with a faster array operation. It might be possible to do that here. But I suspect that the iteration over a few columns (and extend
over many rows) is is ok speed wise.
With bmat
I could construct the same thing as:
In [98]: I, J = D[:,0], np.zeros(D.shape[0],int)
In [99]: M1=sparse.coo_matrix((D[:,1],(I, J)), shape=(7000,1))
In [100]: M2=sparse.coo_matrix((D[:,2],(I, J)), shape=(7000,1))
In [101]: print(sparse.bmat([[M1,M2]]))
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
(4109, 1) 128
(6633, 1) 129
(6634, 1) 127
(6635, 1) 126
Creating a sparse matrix from pandas data frame using scipy.sparse
Try get_dummies
with sparse=True
and maybe use dtype='i8'
(optional) for less memory use
out = pd.get_dummies(df.set_index("X")['Y'],sparse=True,dtype='i8').max(level=0)
print(out)
10 14 15
X
1256 1 1 1
3087 0 1 1
2199 1 1 0
1056 1 0 0
408 0 0 1
transform scipy sparse csr to pandas?
If A is csr_matrix
, you can use .toarray()
(there's also .todense()
that produces a numpy
matrix
, which is also works for the DataFrame
constructor):
df = pd.DataFrame(A.toarray())
You can then use this with pd.concat()
.
A = csr_matrix([[1, 0, 2], [0, 3, 0]])
(0, 0) 1
(0, 2) 2
(1, 1) 3
<class 'scipy.sparse.csr.csr_matrix'>
pd.DataFrame(A.todense())
0 1 2
0 1 0 2
1 0 3 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
0 2 non-null int64
1 2 non-null int64
2 2 non-null int64
In version 0.20, pandas
introduced sparse data structures, including the SparseDataFrame
.
In pandas 1.0, SparseDataFrame
was removed:
In older versions of pandas, the
SparseSeries
andSparseDataFrame
classes were the preferred way to work with sparse data. With the advent of extension arrays, these subclasses are no longer needed. Their purpose is better served by using a regular Series or DataFrame with sparse values instead.
The migration guide shows how to use these new data structures.
For instance, to create a DataFrame
from a sparse matrix:
from scipy.sparse import csr_matrix
A = csr_matrix([[1, 0, 2], [0, 3, 0]])
df = pd.DataFrame.sparse.from_spmatrix(A, columns=['A', 'B', 'C'])
df
A B C
0 1 0 2
1 0 3 0
df.dtypes
A Sparse[float64, 0]
B Sparse[float64, 0]
C Sparse[float64, 0]
dtype: object
Alternatively, you can pass sparse matrices to sklearn
to avoid running out of memory when converting back to pandas
. Just convert your other data to sparse format by passing a numpy
array
to the scipy.sparse.csr_matrix
constructor and use scipy.sparse.hstack
to combine (see docs).
Scipy sparse matrix as DataFrame column
There is a sparse dataframe or dataseries feature. It is still experimental. I've answered a few SO questions about converting back and forth between that and scipy
sparse matrices.
From the sidebar:
Populate a Pandas SparseDataFrame from a SciPy Sparse Coo Matrix
Without such a specialized pandas structure I don't see how a sparse matrix could be added to a pandas frame. The internal structure of a sparse matrix is too different. For a start it is not a subclass of numpy array.
A csr
matrix is an object with data contained in 3 arrays, ma.data
and ma.indices
are 1d arrays with one value for each non-zero element of the array. ma.indptr
has a value for each row of the matrix.
list(ma)
is meaningless. ma.toarray()
produces a 2d array with the same data, and will all those zeros filled in as well.
Other sparse matrix formats store their data in other structures - 3 equal length arrays for coo
, two lists of lists for lil
, and a dictionary of dok
.
How to convert pandas dataframe to a sparse matrix using scipy's csr_matrix?
IIUC and using the third link you shared, you can convert your df
data to sparse data using pd.SparseDtype
, like this
df_sparsed = df.astype(pd.SparseDtype("float", np.nan)
You can read more about pd.SparseDtype
here to choose right parameters for your data and then use it in your above command like this:
csr_matrix(df_sparsed.sparse.to_coo()) # Note you need .sparse accessor to access .to_coo()
Simple one liner will be
csr_matrix(df.astype(pd.SparseDtype("float", np.nan)).sparse.to_coo())
How to create a sparse DataFrame from a list of dicts
I suggest to use the dytpe='Sparse'
for this.
If all elements are numbers you can use dytpe='Sparse'
, dytpe='Sparse[int]'
or dytpe='Sparse[float]'
data = [{"id":'a',"v0":3,"v2":6},
{"id":'b',"v1":1,"v4":7}]
index = [item.pop('id') for item in data]
pd.DataFrame(data, index=index, dtype="Sparse")
If any value is a string you have to use dytpe='Sparse[str]'
.
data = [{"id":'a',"v0":3,"v2":'foo'},
{"id":'b',"v1":1,"v4":'ouch'}]
df = pd.DataFrame(data, dtype="Sparse[str]").set_index("id",verify_integrity=True)
Related Topics
Fitting a Histogram with Python
Nltk-Based Text Processing with Pandas
Activating Anaconda Environment in VScode
How to Copy Inmemoryuploadedfile Object to Disk
How to Efficiently Handle European Decimal Separators Using the Pandas Read_CSV Function
Default Filter in Django Admin
Handling Urllib2's Timeout? - Python
Reading the Target of a .Lnk File in Python
Attributeerror: 'Client' Object Has No Attribute 'Send_Message' (Discord Bot)
How to Change the String Representation of a Python Class
How to Let a Raw_Input Repeat Until I Want to Quit
Pycharm Import External Library
How to Force/Ensure Class Attributes Are a Specific Type