Creating a Pandas Dataframe from a Numpy Array: How to Specify the Index Column and Column Headers

Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?

You need to specify data, index and columns to DataFrame constructor, as in:

>>> pd.DataFrame(data=data[1:,1:],    # values
... index=data[1:,0], # 1st column as index
... columns=data[0,1:]) # 1st row as the column names

edit: as in the @joris comment, you may need to change above to np.int_(data[1:,1:]) to have correct data type.

How to make an index column in NumPy array?

So if I understood your question right then you have to add acolumn to your (presumably) 1D array.

import numpy as np
array = np.random.randint(0, 100,size=100) # random numpy array (1D)
index = np.arange(array.shape[0]) # create index array for indexing
array_with_indices = np.c_[array, index]
array_with indices[:, 1] // 10 + 1 # taking second column as it contains the indices
# or we can convert it to a dataframe if you prefer
df = pd.DataFrame(array, index = index)
# then it should work perfectly
df.index//10 + 1

Then you can insert it to df1.

Creating dataframe with multi level column index from from four 2d numpy arrays

One option is to reshape the data in Fortran order, before creating the dataframe:


# reusing your code
level_1_label = ['location1','location2','location3']
level_2_label = ['x1','x2','x3','x4']
header = pd.MultiIndex.from_product([level_1_label, level_2_label], names=['Location','Variable'])

# np.vstack is just a convenience wrapper around np.concatenate, axis=1
outcome = np.reshape(np.vstack([x1,x2,x3,x4]), (len(x1), -1), order = 'F')
df = pd.DataFrame(outcome, columns = header)
df.index.name = 'Time'

df

Location location1 location2 location3
Variable x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
Time
0 2 1 4 3 4 2 3 1 1 2 2 1
1 2 4 4 3 2 1 3 4 1 4 2 3
2 1 1 4 2 3 4 3 2 3 4 3 1
3 2 3 1 2 2 3 2 1 1 2 2 1
4 3 2 1 1 3 2 4 2 2 4 3 4

Building a DataFrame with column names in Python

You're making the input into DataFrame as a list containing one element or a list in one dimension. You should be passing the actual array. Therefore, remove the brackets surrounding dat:

In [9]: dat = pd.DataFrame(dat, columns = ["Var %d" % (i + 1) for i in range(10)])

In [10]: dat
Out[10]:
Var 1 Var 2 Var 3 Var 4 \
0 0.388888888889 0.388888888889 0.388888888889 0.436943311457
1 0.388888888889 0.388888888889 0.222222222222 0.445720017848
2 0.277777777778 0.277777777778 0.0555555555556 0.442623129181
3 0.111111111111 0.111111111111 0.166666666667 0.465180784545
4 0.5 0.5 0.333333333333 0.445720017848
5 0.388888888889 0.388888888889 0.222222222222 0.449433221856
6 0.388888888889 0.388888888889 0.333333333333 0.442491458743
7 0.333333333333 0.0555555555556 0.777777777778 0.438941511384
8 0.444444444444 0.444444444444 0.444444444444 0.427707051887
9 0.222222222222 0.277777777778 0.5 0.431823227653

Var 5 Var 6 Var 7 Var 8 \
0 0.790590003119 0.502046809222 0.838971773428 0.76049230908
1 0.811477946525 0.506899600792 0.836856648557 0.760617288779
2 0.788341322621 0.503717213312 0.837036254923 0.759975270403
3 0.798337900365 0.525060453789 0.846387521536 0.753358230843
4 0.787804059391 0.506899600792 0.836856648557 0.760501605832
5 0.784362288852 0.505575764415 0.83512539411 0.760417126777
6 0.787743031271 0.502995011027 0.836692391333 0.760611529526
7 0.787804059391 0.506899600792 0.836856648557 0.760501605832
8 0.79760395106 0.505723065708 0.836856648557 0.760501605832
9 0.797173287335 0.507239045809 0.845413649425 0.761341659888

Var 9 Var 10
0 0.820605442278 0
1 0.819548947891 1
2 0.81842187229 2
3 0.824154832595 3
4 0.819548947891 4
5 0.818544294533 5
6 0.819815007518 6
7 0.819548947891 7
8 0.819548947891 8
9 0.823903785101 9

Don't mind the list comprehension for the columns field. I just didn't want to type out all of those Vars :).

Pandas DataFrame from Numpy Array - column order

If your data is already in a dataframe, it's much easier to just pass the values of the Pitch column to savgol_filter:

data_arr_smooth = signal.savgol_filter(data.Pitch.values, window_length, polyorder)
data_fr = pd.DataFrame({'time': data.time.values,'angle': data_arr_smooth})

There's no need to explicitly convert your data to float as long as they are numeric, savgol_filter will do this for you:

If x is not a single or double precision floating point array, it
will be converted to type numpy.float64 before filtering.

If you want both original and smoothed data in you original dataframe then just assign a new column to it:

data['angle'] = signal.savgol_filter(data.Pitch.values, window_length, polyorder)

convert numpy array into dataframe

My favorite way to transform numpy arrays to pandas DataFrames is to pass the columns in a dictionary:

df = pd.DataFrame({'col1':nparray[0], 'col2':nparray[1]})

However, if you have many columns, you can try:

# Create list of column names with the format "colN" (from 1 to N)
col_names = ['col' + str(i) for i in np.arange(nparray.shape[0]) + 1]
# Declare pandas.DataFrame object
df = pd.DataFrame(data=nparray.T, columns=col_names)

In the second solution, you have to restructure your array before passing it to data = .... That is, you have to rearrange nparray so that is has rows and columns. Numpy has a method for that: you simply add .T to your array: nparray.T.

Create Pandas dataframe from numpy array and use first column of the array as index

You passed the complete array as the data param, you need to slice your array also if you want just 4 columns from the array as the data:

In [158]:
df = pd.DataFrame(a[:,1:], index=a[:,0], columns=['A', 'B','C','D'])
df

Out[158]:
A B C D
1 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
2 4.6 3.1 1.5 0.2

Also having duplicate values in the index will make filtering/indexing problematic

So here a[:,1:] I take all the rows but index from column 1 onwards as desired, see the docs

Creating a pandas dataframe from a 2d numpy array (to be a column of 1d numpy arrays) and a 1d np array of labels

You should always try to normalize your data such that each column only contains singular values, not data with a dimension.

In this case, I would do something like this:

>>> df = pd.DataFrame({'x': points[:,0], 'y': points[:, 1], 'label': labels},
columns=['x', 'y', 'label'])
>>> df
x y label
0 1 2 0
1 2 1 1
2 100 100 1
3 -2 -1 1
4 0 0 0
5 -1 -2 0

If you truly insist with keeping points as such, transform them to a list of lists or list of tuples before passing to pandas to avoid this error.

How to convert a pandas dataframe into a numpy array with the column names

  • do a quick search for a val by their "item" and "color" with one of the following options:
    1. Use pandas Boolean indexing
    2. Convert the dataframe into a numpy.recarry using pandas.DataFrame.to_records, and also use Boolean indexing
  • .item is a method for both pandas and numpy, so don't use 'item' as a column name. It has been changed to '_item'.
  • As an FYI, numpy is a pandas dependency, and much of pandas vectorized functionality directly corresponds to numpy.
import pandas as pd
import numpy as np

# test data
df = pd.DataFrame({'_item': ['book', 'book' , 'car', 'car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})

# Use pandas Boolean index to
selected = df[(df._item == 'book') & (df.color == 'blue')]

# print(selected)
_item color val
book blue -109.6

# Alternatively, create a recarray
v = df.to_records(index=False)

# display(v)
rec.array([('book', 'green', -22.7 ), ('book', 'blue', -109.6 ),
('car', 'red', -57.19), ('car', 'green', -11.2 ),
('bike', 'blue', -25.6 ), ('bike', 'red', -33.61)],
dtype=[('_item', 'O'), ('color', 'O'), ('val', '<f8')])

# search the recarray
selected = v[(v._item == 'book') & (v.color == 'blue')]

# print(selected)
[('book', 'blue', -109.6)]

Update in response to OP edit

  • You must first reshape the dataframe using pandas.DataFrame.pivot, and then use the previously mentioned methods.
dfp = df.pivot(index='_item', columns='color', values='val')

# display(dfp)
color blue green red
_item
bike -25.6 NaN -33.61
book -109.6 -22.7 NaN
car NaN -11.2 -57.19

# create a numpy recarray
v = dfp.to_records(index=True)

# display(v)
rec.array([('bike', -25.6, nan, -33.61),
('book', -109.6, -22.7, nan),
('car', nan, -11.2, -57.19)],
dtype=[('_item', 'O'), ('blue', '<f8'), ('green', '<f8'), ('red', '<f8')])

# select data
selected = v.blue[(v._item == 'book')]

# print(selected)
array([-109.6])


Related Topics



Leave a reply



Submit