Convert Pandas dataframe to Sparse Numpy Matrix directly
df.values
is a numpy array, and accessing values that way is always faster than np.array
.
scipy.sparse.csr_matrix(df.values)
You might need to take the transpose first, like df.values.T
. In DataFrames, the columns are axis 0.
Creating a sparse matrix from pandas data frame using scipy.sparse
Try get_dummies
with sparse=True
and maybe use dtype='i8'
(optional) for less memory use
out = pd.get_dummies(df.set_index("X")['Y'],sparse=True,dtype='i8').max(level=0)
print(out)
10 14 15
X
1256 1 1 1
3087 0 1 1
2199 1 1 0
1056 1 0 0
408 0 0 1
Want to create a sparse matrix like dataframe from a dataframe in pandas/python
Update: pd.get_dummies
now accepts sparse=True
to create a SparseArray
output.
pd.get_dummies(s: pd.Series)
can be used to create a one-hot encoding like such:
header = ["ds", "buyer_id", "email_address"]
data = [[23, 305, "fatin1bd@gmail.com"],
[22, 307, "shovonbad@gmail.com"],
[25, 411, "raisulk@gmail.com"],
[22, 588, "saiful.sdp@hotmail.com"],
[24, 664, "osman.dhk@gmail.com"]]
df = pd.DataFrame(data, columns=header)
df.join(pd.get_dummies(df["ds"]))
output:
ds buyer_id email_address 22 23 24 25
0 23 305 fatin1bd@gmail.com 0 1 0 0
1 22 307 shovonbad@gmail.com 1 0 0 0
2 25 411 raisulk@gmail.com 0 0 0 1
3 22 588 saiful.sdp@hotmail.com 1 0 0 0
4 24 664 osman.dhk@gmail.com 0 0 1 0
Just for added clarification: The resulting dataframe is still stored in a dense format. You could use scipy.sparse
matrix formats to store it in a true sparse format.
How to convert pandas dataframe to a sparse matrix using scipy's csr_matrix?
IIUC and using the third link you shared, you can convert your df
data to sparse data using pd.SparseDtype
, like this
df_sparsed = df.astype(pd.SparseDtype("float", np.nan)
You can read more about pd.SparseDtype
here to choose right parameters for your data and then use it in your above command like this:
csr_matrix(df_sparsed.sparse.to_coo()) # Note you need .sparse accessor to access .to_coo()
Simple one liner will be
csr_matrix(df.astype(pd.SparseDtype("float", np.nan)).sparse.to_coo())
How do I create a scipy sparse matrix from a pandas dataframe?
I don't have pandas
installed, so can't start with a dataframe. But lets assume I have extracted a numpy array from dataframe
(doesn't a method or attribute like values
do that?):
In [40]: D
Out[40]:
array([[4109, 2093], # could be other columns
[6633, 2093],
[6634, 2094],
[6635, 2095]])
Making a sparse matrix from that is straight forward - I just need to extract or construct the 3 arrays:
In [41]: M=sparse.coo_matrix((D[:,1], (D[:,0], np.zeros(D.shape[0]))),
shape=(7000,1))
In [42]: M
Out[42]:
<7000x1 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in COOrdinate format>
In [43]: print(M)
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
=======================
Generalized to two 'data' columns
In [70]: D
Out[70]:
array([[4109, 2093, 128],
[6633, 2093, 129],
[6634, 2094, 127],
[6635, 2095, 126]])
In [76]: i,j,data=[],[],[]
In [77]: for col in range(1,D.shape[1]):
i.extend(D[:,0])
j.extend(np.zeros(D.shape[0],int)+(col-1))
data.extend(D[:,col])
....:
In [78]: i
Out[78]: [4109, 6633, 6634, 6635, 4109, 6633, 6634, 6635]
In [79]: j
Out[79]: [0, 0, 0, 0, 1, 1, 1, 1]
In [80]: data
Out[80]: [2093, 2093, 2094, 2095, 128, 129, 127, 126]
In [83]: M=sparse.coo_matrix((data,(i,j)),shape=(7000,D.shape[1]-1))
In [84]: M
Out[84]:
<7000x2 sparse matrix of type '<class 'numpy.int32'>'
with 8 stored elements in COOrdinate format>
In [85]: print(M)
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
(4109, 1) 128
(6633, 1) 129
(6634, 1) 127
(6635, 1) 126
I suspect you could also make separate matrices for each column, and combine them with the sparse.bmat
(block) mechanism, but I'm most familiar with the coo
format.
See
Compiling n submatrices into an NxN matrix in numpy
for another example of building a large sparse matrix from submatrices (here they overlap). There I found a way of joining the blocks with a faster array operation. It might be possible to do that here. But I suspect that the iteration over a few columns (and extend
over many rows) is is ok speed wise.
With bmat
I could construct the same thing as:
In [98]: I, J = D[:,0], np.zeros(D.shape[0],int)
In [99]: M1=sparse.coo_matrix((D[:,1],(I, J)), shape=(7000,1))
In [100]: M2=sparse.coo_matrix((D[:,2],(I, J)), shape=(7000,1))
In [101]: print(sparse.bmat([[M1,M2]]))
(4109, 0) 2093
(6633, 0) 2093
(6634, 0) 2094
(6635, 0) 2095
(4109, 1) 128
(6633, 1) 129
(6634, 1) 127
(6635, 1) 126
create row, column, data pandas dataframe from sparse matrix
The values you want to put in the dataframe are available as
a_coo.row, a_coo.col, a_coo.data
Related Topics
Converting Data Frame Column from Character to Numeric
R for Loop Skip to Next Iteration Ifelse
Return Index of the Smallest Value in a Vector
Scale and Size of Plot in Rstudio Shiny
Calculating Number of Days Between 2 Columns of Dates in Data Frame
Categorize Continuous Variable with Dplyr
How to Resolve the "No Font Name" Issue When Importing Fonts into R Using Extrafont
How to Retry a Statement on Error
The Simplest Way to Convert a List with Various Length Vectors to a Data.Frame in R
Activate Tabpanel from Another Tabpanel
Find All Date Ranges for Overlapping Start and End Dates in R
Warning Message: Line Appears to Contain Embedded Nulls
Get Width of Plot Area in Ggplot2
Make Sequential Numeric Column Names Prefixed with a Letter
R: Ggplot Stacked Bar Chart with Counts on Y Axis But Percentage as Label