Create Sparse Matrix from a Data Frame

Convert Pandas dataframe to Sparse Numpy Matrix directly

df.values is a numpy array, and accessing values that way is always faster than np.array.

scipy.sparse.csr_matrix(df.values)

You might need to take the transpose first, like df.values.T. In DataFrames, the columns are axis 0.

Creating a sparse matrix from pandas data frame using scipy.sparse

Try get_dummies with sparse=True and maybe use dtype='i8' (optional) for less memory use

out = pd.get_dummies(df.set_index("X")['Y'],sparse=True,dtype='i8').max(level=0)


print(out)

10 14 15
X
1256 1 1 1
3087 0 1 1
2199 1 1 0
1056 1 0 0
408 0 0 1

Want to create a sparse matrix like dataframe from a dataframe in pandas/python

Update: pd.get_dummies now accepts sparse=True to create a SparseArray output.

pd.get_dummies(s: pd.Series) can be used to create a one-hot encoding like such:

header = ["ds", "buyer_id", "email_address"]
data = [[23, 305, "fatin1bd@gmail.com"],
[22, 307, "shovonbad@gmail.com"],
[25, 411, "raisulk@gmail.com"],
[22, 588, "saiful.sdp@hotmail.com"],
[24, 664, "osman.dhk@gmail.com"]]
df = pd.DataFrame(data, columns=header)
df.join(pd.get_dummies(df["ds"]))

output:

ds  buyer_id    email_address   22  23  24  25
0 23 305 fatin1bd@gmail.com 0 1 0 0
1 22 307 shovonbad@gmail.com 1 0 0 0
2 25 411 raisulk@gmail.com 0 0 0 1
3 22 588 saiful.sdp@hotmail.com 1 0 0 0
4 24 664 osman.dhk@gmail.com 0 0 1 0

Just for added clarification: The resulting dataframe is still stored in a dense format. You could use scipy.sparse matrix formats to store it in a true sparse format.

How to convert pandas dataframe to a sparse matrix using scipy's csr_matrix?

IIUC and using the third link you shared, you can convert your df data to sparse data using pd.SparseDtype, like this

df_sparsed = df.astype(pd.SparseDtype("float", np.nan)

You can read more about pd.SparseDtype here to choose right parameters for your data and then use it in your above command like this:

csr_matrix(df_sparsed.sparse.to_coo()) # Note you need .sparse accessor to access .to_coo()

Simple one liner will be

csr_matrix(df.astype(pd.SparseDtype("float", np.nan)).sparse.to_coo())

How do I create a scipy sparse matrix from a pandas dataframe?

I don't have pandas installed, so can't start with a dataframe. But lets assume I have extracted a numpy array from dataframe (doesn't a method or attribute like values do that?):

In [40]: D
Out[40]:
array([[4109, 2093], # could be other columns
[6633, 2093],
[6634, 2094],
[6635, 2095]])

Making a sparse matrix from that is straight forward - I just need to extract or construct the 3 arrays:

In [41]: M=sparse.coo_matrix((D[:,1], (D[:,0], np.zeros(D.shape[0]))),
shape=(7000,1))

In [42]: M
Out[42]:
<7000x1 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in COOrdinate format>

In [43]: print(M)
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095

=======================

Generalized to two 'data' columns

In [70]: D
Out[70]:
array([[4109, 2093, 128],
[6633, 2093, 129],
[6634, 2094, 127],
[6635, 2095, 126]])

In [76]: i,j,data=[],[],[]

In [77]: for col in range(1,D.shape[1]):
i.extend(D[:,0])
j.extend(np.zeros(D.shape[0],int)+(col-1))
data.extend(D[:,col])
....:

In [78]: i
Out[78]: [4109, 6633, 6634, 6635, 4109, 6633, 6634, 6635]

In [79]: j
Out[79]: [0, 0, 0, 0, 1, 1, 1, 1]

In [80]: data
Out[80]: [2093, 2093, 2094, 2095, 128, 129, 127, 126]

In [83]: M=sparse.coo_matrix((data,(i,j)),shape=(7000,D.shape[1]-1))

In [84]: M
Out[84]:
<7000x2 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in COOrdinate format>

In [85]: print(M)
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
(4109, 1) 128
(6633, 1) 129
(6634, 1) 127
(6635, 1) 126

I suspect you could also make separate matrices for each column, and combine them with the sparse.bmat (block) mechanism, but I'm most familiar with the coo format.

See
Compiling n submatrices into an NxN matrix in numpy

for another example of building a large sparse matrix from submatrices (here they overlap). There I found a way of joining the blocks with a faster array operation. It might be possible to do that here. But I suspect that the iteration over a few columns (and extend over many rows) is is ok speed wise.

With bmat I could construct the same thing as:

In [98]: I, J = D[:,0], np.zeros(D.shape[0],int)

In [99]: M1=sparse.coo_matrix((D[:,1],(I, J)), shape=(7000,1))
In [100]: M2=sparse.coo_matrix((D[:,2],(I, J)), shape=(7000,1))

In [101]: print(sparse.bmat([[M1,M2]]))
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
(4109, 1) 128
(6633, 1) 129
(6634, 1) 127
(6635, 1) 126

create row, column, data pandas dataframe from sparse matrix

The values you want to put in the dataframe are available as

a_coo.row, a_coo.col, a_coo.data


Related Topics



Leave a reply



Submit