What Does the Term "Broadcasting" Mean in Pandas Documentation

What does the term broadcasting mean in Pandas documentation?

So the term broadcasting comes from numpy, simply put it explains the rules of the output that will result when you perform operations between n-dimensional arrays (could be panels, dataframes, series) or scalar values.

Broadcasting using a scalar value

So the simplest case is just multiplying by a scalar value:

In [4]:
s = pd.Series(np.arange(5))
s

Out[4]:
0 0
1 1
2 2
3 3
4 4
dtype: int32

In [5]:
s * 10

Out[5]:
0 0
1 10
2 20
3 30
4 40
dtype: int32

and we get the same expected results with a dataframe:

In [6]:    
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4)})
df

Out[6]:
a b
0 0.216920 0.652193
1 0.968969 0.033369
2 0.637784 0.856836
3 -2.303556 0.426238

In [7]:
df * 10

Out[7]:
a b
0 2.169204 6.521925
1 9.689690 0.333695
2 6.377839 8.568362
3 -23.035557 4.262381

So what is technically happening here is that the scalar value has been broadcasted along the same dimensions of the Series and DataFrame above.

Broadcasting using a 1-D array

Say we have a 2-D dataframe of shape 4 x 3 (4 rows x 3 columns) we can perform an operation along the x-axis by using a 1-D Series that is the same length as the row-length:

In [8]:
df = pd.DataFrame({'a':np.random.randn(4), 'b':np.random.randn(4), 'c':np.random.randn(4)})
df

Out[8]:
a b c
0 0.122073 -1.178127 -1.531254
1 0.011346 -0.747583 -1.967079
2 -0.019716 -0.235676 1.419547
3 0.215847 1.112350 0.659432

In [26]:
df.iloc[0]

Out[26]:
a 0.122073
b -1.178127
c -1.531254
Name: 0, dtype: float64

In [27]:
df + df.iloc[0]

Out[27]:
a b c
0 0.244146 -2.356254 -3.062507
1 0.133419 -1.925710 -3.498333
2 0.102357 -1.413803 -0.111707
3 0.337920 -0.065777 -0.871822

the above looks funny at first until you understand what is happening, I took the first row of values and added this row-wise to the df, it can be visualised using this pic (sourced from scipy):

Sample Image

The general rule is this:

In order to broadcast, the size of the trailing axes for both arrays
in an operation must either be the same size or one of them must be
one.

So if I tried to add a 1-D array that didn't match in length, say one with 4 elements, unlike numpy which will raise a ValueError, in Pandas you'll get a df full of NaN values:

In [30]:
df + pd.Series(np.arange(4))

Out[30]:
a b c 0 1 2 3
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN

Now some of the great things about pandas is that it will try to align using existing column names and row labels, this can get in the way of trying to perform a fancier broadcasting like this:

Sample Image

In [55]:
df[['a']] + df.iloc[0]

Out[55]:
a b c
0 0.244146 NaN NaN
1 0.133419 NaN NaN
2 0.102357 NaN NaN
3 0.337920 NaN NaN

In the above I use double subscripting to force the shape to be (4,1) but we see a problem when trying to broadcast using the first row as the column alignment only aligns on the first column. To get the same form of broadcasting to occur like the diagram above shows we have to decompose to numpy arrays which then become anonymous data:

In [56]:
df[['a']].values + df.iloc[0].values

Out[56]:
array([[ 0.24414608, -1.05605392, -1.4091805 ],
[ 0.13341899, -1.166781 , -1.51990758],
[ 0.10235701, -1.19784299, -1.55096957],
[ 0.33792013, -0.96227987, -1.31540645]])

It's also possible to broadcast in 3-dimensions but I don't go near that stuff often but the numpy, scipy and pandas book have examples that show how that works.

Generally speaking the thing to remember is that aside from scalar values which are simple, for n-D arrays the minor/trailing axes length must match or one of them must be 1.

Update

it seems that the above now leads to ValueError: Unable to coerce to Series, length must be 1: given 3 in latest version of pandas 0.20.2

so you have to call .values on the df first:

In[42]:
df[['a']].values + df.iloc[0].values

Out[42]:
array([[ 0.244146, -1.056054, -1.409181],
[ 0.133419, -1.166781, -1.519908],
[ 0.102357, -1.197843, -1.55097 ],
[ 0.33792 , -0.96228 , -1.315407]])

To restore this back to the original df we can construct a df from the np array and pass the original columns in the args to the constructor:

In[43]:
pd.DataFrame(df[['a']].values + df.iloc[0].values, columns=df.columns)

Out[43]:
a b c
0 0.244146 -1.056054 -1.409181
1 0.133419 -1.166781 -1.519908
2 0.102357 -1.197843 -1.550970
3 0.337920 -0.962280 -1.315407

Broadcasting Error Pandas

You can just do div and pass axis=0 to force the division to be performed column-wise:

df2 = pd.DataFrame(df.ix[:,['col1', 'col2', 'col3']].div(df.col4, axis=0))

Your error is because the division using / is being performed on the minor axis which in this case is the row axis and there is no direct alignment, see this example:

In [220]:
df = pd.DataFrame(columns=list('abcd'), data = np.random.randn(8,4))
df

Out[220]:
a b c d
0 1.074803 0.173520 0.211027 1.357138
1 1.418757 -1.879024 0.536826 1.006160
2 -0.029716 -1.146178 0.100900 -1.035018
3 0.314665 -0.773723 -1.170653 0.648740
4 -0.179666 1.291836 -0.009614 0.392149
5 0.264599 -0.057409 -1.425638 1.024098
6 -0.106062 1.824375 0.595974 1.167115
7 0.601544 -1.237881 0.106854 -1.276829

In [221]:
df.ix[:,['a', 'b', 'c']]/df['d']

Out[221]:
a b c 0 1 2 3 4 5 6 7
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

This isn't obvious until you understand how broadcasting works.

When does Pandas default to broadcasting Series and Dataframes?

What is happening is pandas using intrinsic data alignment. Pandas almost always aligns the data on indexes, either row index or column headers. Here is a quick example:

s1 = pd.Series([1,2,3], index=['a','b','c'])
s2 = pd.Series([2,4,6], index=['a','b','c'])
s1 + s2
#Ouput as expected:
a 3
b 6
c 9
dtype: int64

Now, let's run a couple other examples with different indexing:

s2 = pd.Series([2,4,6], index=['a','a','c'])
s1 + s2
#Ouput
a 3.0
a 5.0
b NaN
c 9.0
dtype: float64

A cartesian product happens with duplicated indexes, and matching is NaN + value = NaN.

And, no matching indexes:

s2 = pd.Series([2,4,6], index=['e','f','g'])
s1 + s2
#Output
a NaN
b NaN
c NaN
e NaN
f NaN
g NaN
dtype: float64

So, in your first example you are creating pd.Series and pd.DataFrame with default range indexes that match, hence the comparison is happening as expected. In your second example, you are comparing column headers ['cell2','cell3','cell4','cell5'] with a the default range index which is returning all 15 columns and no matches all values will be False, NaN comparison returns False.

Inconsistent results when adding a new column in Pandas DataFrame. Is it a Series or a Value?

It's because the * operator is implemented as a mul operator whilst upper isn't defined for a Series. You have to use str.upper which is implemented for a Series where the dtype is str:

In[53]:
df['new_text'] = df['new_col'].str.upper()
df

Out[53]:
A new_col new_text
1 5 text TEXT
2 6 text TEXT
3 7 text TEXT

There is no magic here.

For df['new_col'] this is just assigning a scalar value and conforming to broadcasting rules, where the scalar is broadcast to the length of the df along the minor axis, see this for an explanation of that: What does the term "broadcasting" mean in Pandas documentation?

Add/subtract value of a column to the entire column of the dataframe pandas

df2 = df - df.iloc[0]


Explanation:

Let's work through an example.

df = pd.DataFrame(np.arange(20).reshape(4, 5))














































01234
001234
156789
21011121314
31516171819

Numpy array broadcasting rules

Well, the meaning of trailing axes is explained on the linked documentation page.
If you have two arrays with different dimensions number, say one 1x2x3 and other 2x3, then you compare only the trailing common dimensions, in this case 2x3. But if both your arrays are two-dimensional, then their corresponding sizes have to be either equal or one of them has to be 1. Dimensions along which the array has size 1 are called singular, and the array can be broadcasted along them.

In your case you have a 2x2 and 4x2 and 4 != 2 and neither 4 or 2 equals 1, so this doesn't work.

Assigning column slice with values from another column doesn't throw shape mismatch error

Pandas is clever, so you can offload the broadcasting to it and it'll only assign the values at specified indices. This will work everytime you assign a Series to a column as long as the indices match.

Here's another example of how it works:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6],}, index=['a', 'b', 'c'])
df
a b
a 1 4
b 2 5
c 3 6

df.loc[['a', 'b'], 'a'] = pd.Series([4, 5, 6], index=['b', 'c', 'a'])
df

a b
a 6 4
b 4 5
c 3 6

Add multiple empty columns to pandas DataFrame

I'd concat using a DataFrame:

In [23]:
df = pd.DataFrame(columns=['A'])
df

Out[23]:
Empty DataFrame
Columns: [A]
Index: []

In [24]:
pd.concat([df,pd.DataFrame(columns=list('BCD'))])

Out[24]:
Empty DataFrame
Columns: [A, B, C, D]
Index: []

So by passing a list containing your original df, and a new one with the columns you wish to add, this will return a new df with the additional columns.


Caveat: See the discussion of performance in the other answers and/or the comment discussions. reindex may be preferable where performance is critical.



Related Topics



Leave a reply



Submit