Convert Pandas Dataframe to Numpy Array

How to convert a pandas dataframe to NumPy array

It seems that you want to convert the DataFrame into a 1D array (this should be clear in the post).

First, convert the DataFrame to a 2D numpy array using DataFrame.to_numpy (using DataFrame.values is discouraged) and then use ndarray.ravel or ndarray.flatten to flatten the array.

arr = df.to_numpy().ravel()

Convert pandas dataframe to numpy array - which approach to prefer?

The functions you mention serve different purposes.

  1. pd.to_numeric: Use this to convert types in your dataframe if your data is not currently stored in numeric form or if you wish
    to cast as an optimal type via downcast='float' or
    downcast='integer'.

  2. pd.DataFrame.to_numpy() (v0.24+) or pd.DataFrame.values: Use this to retrieve numpy array representation of your dataframe.

  3. pd.DataFrame.as_matrix: Do not use this. It is included only for backwards compatibility.

How to convert pandas data frame to NumPy array?

Use list comprehension with nested dictionary comprehension for DataFrame:

df = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])

df = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print (df)
1 2 3 4
0 0.0033524514 -0.021896651 0.05087798 0
1 0.02134219 -0.007388343 0.06835007 0
2 0.030515702 -0.0037591448 0.066626 0
3 0.0069114454 -0.0149497045 0.020777626 0
4 0.003118149 -0.015105667 0.040879637 0.4

And then convert to floats and to numpy array:

print (df.astype(float).to_numpy())
[[ 0.00335245 -0.02189665 0.05087798 0. ]
[ 0.02134219 -0.00738834 0.06835007 0. ]
[ 0.0305157 -0.00375914 0.066626 0. ]
[ 0.00691145 -0.0149497 0.02077763 0. ]
[ 0.00311815 -0.01510567 0.04087964 0.4 ]]

Convert pandas DataFrame columns to NumPy array (extension question)

You mean you want the transposed array?

df.T.values

How to convert a pandas dataframe into a numpy array with the column names

  • do a quick search for a val by their "item" and "color" with one of the following options:
    1. Use pandas Boolean indexing
    2. Convert the dataframe into a numpy.recarry using pandas.DataFrame.to_records, and also use Boolean indexing
  • .item is a method for both pandas and numpy, so don't use 'item' as a column name. It has been changed to '_item'.
  • As an FYI, numpy is a pandas dependency, and much of pandas vectorized functionality directly corresponds to numpy.
import pandas as pd
import numpy as np

# test data
df = pd.DataFrame({'_item': ['book', 'book' , 'car', 'car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})

# Use pandas Boolean index to
selected = df[(df._item == 'book') & (df.color == 'blue')]

# print(selected)
_item color val
book blue -109.6

# Alternatively, create a recarray
v = df.to_records(index=False)

# display(v)
rec.array([('book', 'green', -22.7 ), ('book', 'blue', -109.6 ),
('car', 'red', -57.19), ('car', 'green', -11.2 ),
('bike', 'blue', -25.6 ), ('bike', 'red', -33.61)],
dtype=[('_item', 'O'), ('color', 'O'), ('val', '<f8')])

# search the recarray
selected = v[(v._item == 'book') & (v.color == 'blue')]

# print(selected)
[('book', 'blue', -109.6)]

Update in response to OP edit

  • You must first reshape the dataframe using pandas.DataFrame.pivot, and then use the previously mentioned methods.
dfp = df.pivot(index='_item', columns='color', values='val')

# display(dfp)
color blue green red
_item
bike -25.6 NaN -33.61
book -109.6 -22.7 NaN
car NaN -11.2 -57.19

# create a numpy recarray
v = dfp.to_records(index=True)

# display(v)
rec.array([('bike', -25.6, nan, -33.61),
('book', -109.6, -22.7, nan),
('car', nan, -11.2, -57.19)],
dtype=[('_item', 'O'), ('blue', '<f8'), ('green', '<f8'), ('red', '<f8')])

# select data
selected = v.blue[(v._item == 'book')]

# print(selected)
array([-109.6])

Trying to convert pandas df to np array, dtaidistance computes list instead

(one of the dtaidistance authors here)

The dtaidistance package expects one of three formats:

  • A 2D numpy array (where all sequences have the same length by definition)
  • A Python list of 1D numpy.array or array.array.
  • A Python list of Python lists

In your case you could do:

series = move_df['movement'].to_list()
dtw.distance_matrix(series)

which works then on a list of lists.

To use the fast C implementation an array is required (either Numpy or std lib array). If you want to keep different lengths you can do

series = move_df['movement'].apply(lambda a: np.array(a, dtype=np.double)).to_list()
dtw.distance_matrix_fast(series)

Note that it might make sense to do the apply operation inplace on your move_df datastructure such that you only have to do it once and not keep track of two nearly identical datastructures. After you do this, the to_list call is sufficient. Thus:

move_df['movement'] = move_df['movement'].apply(lambda a: np.array(a, dtype=np.double))
series = move_df['movement'].to_list()
dtw.distance_matrix_fast(series)

If you want to use a 2D numpy matrix, you would need to truncate or pad all series to be the same length as is explained in other answers (for dtw padding is more common to not lose information).

ps. This assumes you want to do univariate DTW, the ndim subpackage for multivariate time series expects a different datastructure.



Related Topics



Leave a reply



Submit