How to convert a pandas dataframe to NumPy array
It seems that you want to convert the DataFrame into a 1D array (this should be clear in the post).
First, convert the DataFrame to a 2D numpy array using DataFrame.to_numpy
(using DataFrame.values
is discouraged) and then use ndarray.ravel
or ndarray.flatten
to flatten the array.
arr = df.to_numpy().ravel()
Convert pandas dataframe to numpy array - which approach to prefer?
The functions you mention serve different purposes.
pd.to_numeric
: Use this to convert types in your dataframe if your data is not currently stored in numeric form or if you wish
to cast as an optimal type viadowncast='float'
ordowncast='integer'
.pd.DataFrame.to_numpy()
(v0.24+) orpd.DataFrame.values
: Use this to retrievenumpy
array representation of your dataframe.pd.DataFrame.as_matrix
: Do not use this. It is included only for backwards compatibility.
How to convert pandas data frame to NumPy array?
Use list comprehension with nested dictionary comprehension for DataFrame
:
df = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print (df)
1 2 3 4
0 0.0033524514 -0.021896651 0.05087798 0
1 0.02134219 -0.007388343 0.06835007 0
2 0.030515702 -0.0037591448 0.066626 0
3 0.0069114454 -0.0149497045 0.020777626 0
4 0.003118149 -0.015105667 0.040879637 0.4
And then convert to floats and to numpy array:
print (df.astype(float).to_numpy())
[[ 0.00335245 -0.02189665 0.05087798 0. ]
[ 0.02134219 -0.00738834 0.06835007 0. ]
[ 0.0305157 -0.00375914 0.066626 0. ]
[ 0.00691145 -0.0149497 0.02077763 0. ]
[ 0.00311815 -0.01510567 0.04087964 0.4 ]]
Convert pandas DataFrame columns to NumPy array (extension question)
You mean you want the transposed array?
df.T.values
How to convert a pandas dataframe into a numpy array with the column names
- do a quick search for a val by their "item" and "color" with one of the following options:
- Use pandas Boolean indexing
- Convert the dataframe into a
numpy.recarry
usingpandas.DataFrame.to_records
, and also use Boolean indexing
.item
is a method for bothpandas
andnumpy
, so don't use'item'
as a column name. It has been changed to'_item'
.- As an FYI,
numpy
is apandas
dependency, and much ofpandas
vectorized functionality directly corresponds tonumpy
.
import pandas as pd
import numpy as np
# test data
df = pd.DataFrame({'_item': ['book', 'book' , 'car', 'car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
# Use pandas Boolean index to
selected = df[(df._item == 'book') & (df.color == 'blue')]
# print(selected)
_item color val
book blue -109.6
# Alternatively, create a recarray
v = df.to_records(index=False)
# display(v)
rec.array([('book', 'green', -22.7 ), ('book', 'blue', -109.6 ),
('car', 'red', -57.19), ('car', 'green', -11.2 ),
('bike', 'blue', -25.6 ), ('bike', 'red', -33.61)],
dtype=[('_item', 'O'), ('color', 'O'), ('val', '<f8')])
# search the recarray
selected = v[(v._item == 'book') & (v.color == 'blue')]
# print(selected)
[('book', 'blue', -109.6)]
Update in response to OP edit
- You must first reshape the dataframe using
pandas.DataFrame.pivot
, and then use the previously mentioned methods.
dfp = df.pivot(index='_item', columns='color', values='val')
# display(dfp)
color blue green red
_item
bike -25.6 NaN -33.61
book -109.6 -22.7 NaN
car NaN -11.2 -57.19
# create a numpy recarray
v = dfp.to_records(index=True)
# display(v)
rec.array([('bike', -25.6, nan, -33.61),
('book', -109.6, -22.7, nan),
('car', nan, -11.2, -57.19)],
dtype=[('_item', 'O'), ('blue', '<f8'), ('green', '<f8'), ('red', '<f8')])
# select data
selected = v.blue[(v._item == 'book')]
# print(selected)
array([-109.6])
Trying to convert pandas df to np array, dtaidistance computes list instead
(one of the dtaidistance authors here)
The dtaidistance package expects one of three formats:
- A 2D numpy array (where all sequences have the same length by definition)
- A Python list of 1D numpy.array or array.array.
- A Python list of Python lists
In your case you could do:
series = move_df['movement'].to_list()
dtw.distance_matrix(series)
which works then on a list of lists.
To use the fast C implementation an array is required (either Numpy or std lib array). If you want to keep different lengths you can do
series = move_df['movement'].apply(lambda a: np.array(a, dtype=np.double)).to_list()
dtw.distance_matrix_fast(series)
Note that it might make sense to do the apply operation inplace on your move_df datastructure such that you only have to do it once and not keep track of two nearly identical datastructures. After you do this, the to_list call is sufficient. Thus:
move_df['movement'] = move_df['movement'].apply(lambda a: np.array(a, dtype=np.double))
series = move_df['movement'].to_list()
dtw.distance_matrix_fast(series)
If you want to use a 2D numpy matrix, you would need to truncate or pad all series to be the same length as is explained in other answers (for dtw padding is more common to not lose information).
ps. This assumes you want to do univariate DTW, the ndim
subpackage for multivariate time series expects a different datastructure.
Related Topics
Getting the Id of the Last Record Inserted for Postgresql Serial Key With Python
Loop Over List of Elements for Find_Element_By_Xpath() by Selenium and Webdriver
Python Json Serialize a Decimal Object
How to Find the Closest Values in a Pandas Series to an Input Number
Pandas: Group by Name and Take Row With Most Recent Date
Plot Line Graph from Pandas Dataframe (With Multiple Lines)
Set Working Directory in Python/Spyder So That It's Reproducible
Matplotlib: Attributeerror: 'Axessubplot' Object Has No Attribute 'Add_Axes'
Typing Greek Letters etc. in Plots
Saving Numpy Array to Txt File Row Wise
How to Iterate Over a Timespan After Days, Hours, Weeks and Months
Merging Two Dataframes With Different Lengths
High Pass Filter for Image Processing in Python by Using Scipy/Numpy
Python: How to Match Nested Parentheses With Regex
How to Make a Discord Bot Leave a Server from a Command in Another Server
Replacing Pandas or Numpy Nan With a None to Use With Mysqldb