Ambiguity in Pandas Dataframe/Numpy Array "Axis" Definition

Ambiguity in Pandas Dataframe / Numpy Array axis definition

It's perhaps simplest to remember it as 0=down and 1=across.

This means:

  • Use axis=0 to apply a method down each column, or to the row labels (the index).
  • Use axis=1 to apply a method across each row, or to the column labels.

Here's a picture to show the parts of a DataFrame that each axis refers to:

It's also useful to remember that Pandas follows NumPy's use of the word axis. The usage is explained in NumPy's glossary of terms:

Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]

So, concerning the method in the question, df.mean(axis=1), seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, df.mean(axis=0) would be an operation acting vertically downwards across rows.

Similarly, df.drop(name, axis=1) refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0 would make the method act on rows instead.

Why is the axes for the .mean() method in pandas the opposite in this scenario?

Just need to tell mean to work across columns with axis=1

df = pd.DataFrame({"height_1":[1.78,1.7,1.74,1.66],"height_2":[1.8,1.7,1.75,1.68],"height_3":[1.8,1.69,1.73,1.67]})
df = df.assign(height_mean=df.mean(axis=1))
df = df.assign(height_mean=df.loc[:,['height_1','height_2','height_3']].mean(axis=1))
print(df.to_string(index=False))

output

 height_1  height_2  height_3  height_mean
1.78 1.80 1.80 1.793333
1.70 1.70 1.69 1.696667
1.74 1.75 1.73 1.740000
1.66 1.68 1.67 1.670000

numpy maximum reduce error for pandas series and int

See the docs for ufunc.reduce

.reduce(array, axis=0, dtype=None, out=None, keepdims=False, initial=<no value>, where=True)

Reduces array’s dimension by one, by applying ufunc along one axis.

[df['a'], 2] is not an array with a well-defined 0th axis. I’m not sure how this could work? The other operations are clear element-wise max operations which will operate on each argument after broadcasting against each other but numpy ufunc reduction operates on a single array.

Filtering byte stream efficiently before converting to numpy array / pandas dataframe

You can specify an offset for each field during dtype construction:

struct_dtypes = np.dtype({'names': ['n1', 'n2'], 'formats': ['d', 'd'], 'offsets': [0, 16]})

or

struct_dtypes = np.dtype({'n1': ('d', 0), 'n2': ('d', 16)})

Update (see comments below):

If you don't read the last element in the record, you need to specify the itemsize:

struct_dtypes = np.dtype({'names': ['n1', 'ch'],
'formats': ['d', '8V'],
'offsets': [0, 8],
'itemsize': 24})


Related Topics



Leave a reply



Submit