How to Operate on a Dataframe with a Series for Every Column

How do I operate on a DataFrame with a Series for every column?

It is helpful to create a mental model of what Series and DataFrame objects are.

Anatomy of a Series

A Series should be thought of as an enhanced dictionary. This isn't always a perfect analogy, but we'll start here. Also, there are other analogies that you can make, but I am targeting a dictionary in order to demonstrate the purpose of this post.

index

These are the keys that we can reference to get at the corresponding values. When the elements of the index are unique, the comparison to a dictionary becomes very close.

values

These are the corresponding values that are keyed by the index.

Anatomy of a DataFrame

A DataFrame should be thought of as a dictionary of Series or a Series of Series. In this case the keys are the column names and the values are the columns themselves as Series objects. Each Series agrees to share the same index which is the index of the DataFrame.

columns

These are the keys that we can reference to get at the corresponding Series.

index

This the the index that all of the Series values agree to share.

Note: RE: columns and index objects

They are the same kind of things. A DataFrames index can be used as another DataFrames columns. In fact, this happens when you do df.T to get a transpose.

values

This is a two-dimensional array that contains the data in a DataFrame. The reality is that values is not what is stored inside the DataFrame object. (Well, sometimes it is, but I'm not about to try to describe the block manager). The point is, it is better to think of this as access to a two-dimensional array of the data.



Define Sample Data

These are sample pandas.Index objects that can be used as the index of a Series or DataFrame or can be used as the columns of a DataFrame:

idx_lower = pd.Index([*'abcde'], name='lower')
idx_range = pd.RangeIndex(5, name='range')

These are sample pandas.Series objects that use the pandas.Index objects above:

s0 = pd.Series(range(10, 15), idx_lower)
s1 = pd.Series(range(30, 40, 2), idx_lower)
s2 = pd.Series(range(50, 10, -8), idx_range)

These are sample pandas.DataFrame objects that use the pandas.Index objects above:

df0 = pd.DataFrame(100, index=idx_range, columns=idx_lower)
df1 = pd.DataFrame(
np.arange(np.product(df0.shape)).reshape(df0.shape),
index=idx_range, columns=idx_lower
)


Series on Series

When operating on two Series, the alignment is obvious. You align the index of one Series with the index of the other.

s1 + s0

lower
a 40
b 43
c 46
d 49
e 52
dtype: int64

Which is the same as when I randomly shuffle one before I operate. The indices will still align.

s1 + s0.sample(frac=1)

lower
a 40
b 43
c 46
d 49
e 52
dtype: int64

And is not the case when instead I operate with the values of the shuffled Series. In this case, Pandas doesn't have the index to align with and therefore operates from a positions.

s1 + s0.sample(frac=1).values

lower
a 42
b 42
c 47
d 50
e 49
dtype: int64

Add a scalar

s1 + 1

lower
a 31
b 33
c 35
d 37
e 39
dtype: int64


DataFrame on DataFrame

The similar is true when operating between two DataFrames. The alignment is obvious and does what we think it should do:

df0 + df1

lower a b c d e
range
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
4 120 121 122 123 124

It shuffles the second DataFrame on both axes. The index and columns will still align and give us the same thing.

df0 + df1.sample(frac=1).sample(frac=1, axis=1)

lower a b c d e
range
0 100 101 102 103 104
1 105 106 107 108 109
2 110 111 112 113 114
3 115 116 117 118 119
4 120 121 122 123 124

It is the same shuffling, but it adds the array and not the DataFrame. It is no longer aligned and will get different results.

df0 + df1.sample(frac=1).sample(frac=1, axis=1).values

lower a b c d e
range
0 123 124 121 122 120
1 118 119 116 117 115
2 108 109 106 107 105
3 103 104 101 102 100
4 113 114 111 112 110

Add a one-dimensional array. It will align with columns and broadcast across rows.

df0 + [*range(2, df0.shape[1] + 2)]

lower a b c d e
range
0 102 103 104 105 106
1 102 103 104 105 106
2 102 103 104 105 106
3 102 103 104 105 106
4 102 103 104 105 106

Add a scalar. There isn't anything to align with, so broadcasts to everything:

df0 + 1

lower a b c d e
range
0 101 101 101 101 101
1 101 101 101 101 101
2 101 101 101 101 101
3 101 101 101 101 101
4 101 101 101 101 101


DataFrame on Series

If DataFrames are to be thought of as dictionaries of Series and Series are to be thought of as dictionaries of values, then it is natural that when operating between a DataFrame and Series that they should be aligned by their "keys".

s0:
lower a b c d e
10 11 12 13 14

df0:
lower a b c d e
range
0 100 100 100 100 100
1 100 100 100 100 100
2 100 100 100 100 100
3 100 100 100 100 100
4 100 100 100 100 100

And when we operate, the 10 in s0['a'] gets added to the entire column of df0['a']:

df0 + s0

lower a b c d e
range
0 110 111 112 113 114
1 110 111 112 113 114
2 110 111 112 113 114
3 110 111 112 113 114
4 110 111 112 113 114

The heart of the issue and point of the post

What about if I want s2 and df0?

s2:               df0:

| lower a b c d e
range | range
0 50 | 0 100 100 100 100 100
1 42 | 1 100 100 100 100 100
2 34 | 2 100 100 100 100 100
3 26 | 3 100 100 100 100 100
4 18 | 4 100 100 100 100 100

When I operate, I get the all np.nan as cited in the question:

df0 + s2

a b c d e 0 1 2 3 4
range
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

This does not produce what we wanted, because Pandas is aligning the index of s2 with the columns of df0. The columns of the result includes a union of the index of s2 and the columns of df0.

We could fake it out with a tricky transposition:

(df0.T + s2).T

lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118

But it turns out Pandas has a better solution. There are operation methods that allow us to pass an axis argument to specify the axis to align with.

- sub
+ add
* mul
/ div
** pow

And so the answer is simply:

df0.add(s2, axis='index')

lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118

It turns out axis='index' is synonymous with axis=0.
As is axis='columns' synonymous with axis=1:

df0.add(s2, axis=0)

lower a b c d e
range
0 150 150 150 150 150
1 142 142 142 142 142
2 134 134 134 134 134
3 126 126 126 126 126
4 118 118 118 118 118


The rest of the operations

df0.sub(s2, axis=0)

lower a b c d e
range
0 50 50 50 50 50
1 58 58 58 58 58
2 66 66 66 66 66
3 74 74 74 74 74
4 82 82 82 82 82


df0.mul(s2, axis=0)

lower a b c d e
range
0 5000 5000 5000 5000 5000
1 4200 4200 4200 4200 4200
2 3400 3400 3400 3400 3400
3 2600 2600 2600 2600 2600
4 1800 1800 1800 1800 1800


df0.div(s2, axis=0)

lower a b c d e
range
0 2.000000 2.000000 2.000000 2.000000 2.000000
1 2.380952 2.380952 2.380952 2.380952 2.380952
2 2.941176 2.941176 2.941176 2.941176 2.941176
3 3.846154 3.846154 3.846154 3.846154 3.846154
4 5.555556 5.555556 5.555556 5.555556 5.555556


df0.pow(1 / s2, axis=0)

lower a b c d e
range
0 1.096478 1.096478 1.096478 1.096478 1.096478
1 1.115884 1.115884 1.115884 1.115884 1.115884
2 1.145048 1.145048 1.145048 1.145048 1.145048
3 1.193777 1.193777 1.193777 1.193777 1.193777
4 1.291550 1.291550 1.291550 1.291550 1.291550

It's important to address some higher level concepts first. Since my motivation is to share knowledge and teach, I wanted to make this as clear as possible.

Assign a series to ALL columns of the dataFrame (columnwise)?

Use to_frame with reindex:

a = ser.to_frame().reindex(columns=df.columns, method='ffill')
print (a)
0 1
0 1 1
1 2 2
2 3 3

But it seems easier is solution from comment, there was added columns parameter if need same order columns as original with real data:

df = pd.DataFrame({c:ser for c in df.columns}, columns=df.columns)

How do I operate (subtract) a Series with a DataFrame for every column?

A new idea :

df.rsub(s0, axis=0)

0 1
0 0 -1
1 -1 -2
2 -2 -3

How to iterate over columns of pandas dataframe to run regression

for column in df:
print(df[column])

How to show all columns' names on a large pandas dataframe?

You can globally set printing options. I think this should work:

Method 1:

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Method 2:

pd.options.display.max_columns = None
pd.options.display.max_rows = None

This will allow you to see all column names & rows when you are doing .head(). None of the column name will be truncated.


If you just want to see the column names you can do:

print(df.columns.tolist())

Pretty-print an entire Pandas Series / DataFrame

You can also use the option_context, with one or more options:

with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
print(df)

This will automatically return the options to their previous values.

If you are working on jupyter-notebook, using display(df) instead of print(df) will use jupyter rich display logic (like so).

How to set all the values of an existing Pandas DataFrame to zero?

The absolute fastest way, which also preserves dtypes, is the following:

for col in df.columns:
df[col].values[:] = 0

This directly writes to the underlying numpy array of each column. I doubt any other method will be faster than this, as this allocates no additional storage and doesn't pass through pandas's dtype handling. You can also use np.issubdtype to only zero out numeric columns. This is probably what you want if you have a mixed dtype DataFrame, but of course it's not necessary if your DataFrame is already entirely numeric.

for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0

For small DataFrames, the subtype check is somewhat costly. However, the cost of zeroing a non-numeric column is substantial, so if you're not sure whether your DataFrame is entirely numeric, you should probably include the issubdtype check.


Timing comparisons

Setup

import pandas as pd
import numpy as np

def make_df(n, only_numeric):
series = [
pd.Series(range(n), name="int", dtype=int),
pd.Series(range(n), name="float", dtype=float),
]
if only_numeric:
series.extend(
[
pd.Series(range(n, 2 * n), name="int2", dtype=int),
pd.Series(range(n, 2 * n), name="float2", dtype=float),
]
)
else:
series.extend(
[
pd.date_range(start="1970-1-1", freq="T", periods=n, name="dt")
.to_series()
.reset_index(drop=True),
pd.Series(
[chr((i % 26) + 65) for i in range(n)],
name="string",
dtype="object",
),
]
)

return pd.concat(series, axis=1)

>>> make_df(5, True)
int float int2 float2
0 0 0.0 5 5.0
1 1 1.0 6 6.0
2 2 2.0 7 7.0
3 3 3.0 8 8.0
4 4 4.0 9 9.0

>>> make_df(5, False)
int float dt string
0 0 0.0 1970-01-01 00:00:00 A
1 1 1.0 1970-01-01 00:01:00 B
2 2 2.0 1970-01-01 00:02:00 C
3 3 3.0 1970-01-01 00:03:00 D
4 4 4.0 1970-01-01 00:04:00 E

Small DataFrame

n = 10_000                                                                                  

# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
df[col].values[:] = 0
36.1 µs ± 510 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
53 µs ± 645 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
df[col].values[:] = 0
113 µs ± 391 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
39.4 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Large DataFrame

n = 10_000_000                                                                             

# Numeric df, no issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
df[col].values[:] = 0
38.7 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Numeric df, yes issubdtype check
%%timeit df = make_df(n, True)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
39.1 ms ± 556 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Non-numeric df, no issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
df[col].values[:] = 0
99.5 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Non-numeric df, yes issubdtype check
%%timeit df = make_df(n, False)
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number):
df[col].values[:] = 0
17.8 ms ± 228 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I’d previously suggested the answer below, but I now consider it harmful — it’s significantly slower than the above answers and is harder to reason about. Its only advantage is being nicer to write.

The cleanest way is to use a bare colon to reference the entire
dataframe.

df[:] = 0

Unfortunately the dtype situation is a bit fuzzy because every
column in the resulting dataframe will have the same dtype. If every
column of df was originally float, the new dtypes will still be
float. But if a single column was int or object, it seems that
the new dtypes will all be int.



Related Topics



Leave a reply



Submit