Apply VS Transform on a Group Object

Apply vs transform on a group object

Two major differences between apply and transform

There are two major differences between the transform and apply groupby methods.

  • Input:
    • apply implicitly passes all the columns for each group as a DataFrame to the custom function.
    • while transform passes each column for each group individually as a Series to the custom function.
  • Output:
    • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list).
    • The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

Inspecting the custom function

It can help quite a bit to inspect the input to your custom function passed to apply or transform.

Examples

Let's create some sample data and inspect the groups so that you can see what I am talking about:

import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'],
'a':[4,5,1,3], 'b':[6,10,3,11]})

State a b
0 Texas 4 6
1 Texas 5 10
2 Florida 1 3
3 Florida 3 11

Let's create a simple custom function that prints out the type of the implicitly passed object and then raises an exception so that execution can be stopped.

def inspect(x):
print(type(x))
raise

Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.

Now, let's do the same thing with transform

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

It is passed a Series - a totally different Pandas object.

So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:

def subtract_two(x):
return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:

df.groupby('State').apply(subtract_two)

State
Florida 2 -2
3 -8
Texas 0 -2
1 -5
dtype: int64

The output is a Series and a little confusing as the original index is kept, but we have access to all columns.



Displaying the passed pandas object

It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

from IPython.display import display
def subtract_two(x):
display(x)
return x['a'] - x['b']

Screenshot:
Sample Image



Transform must return a single dimensional sequence the same size as the group

The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

def return_three(x):
return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

def rand_group_len(x):
return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

a b
0 0.962070 0.151440
1 0.440956 0.782176
2 0.642218 0.483257
3 0.056047 0.238208


Returning a single scalar object also works for transform

If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:

def group_sum(x):
return x.sum()

df.groupby('State').transform(group_sum)

a b
0 9 16
1 9 16
2 4 14
3 4 14

Pandas groupby apply vs transform with specific functions

I believe, in the first place, that there is some room for intuition in using these functions as they can be very meaningful.

In your first result, you are not actually trying to transform your values, but rather to aggregate them (which would work in the way you intended).

But getting into code, the transform docs are quite suggestive in saying that

Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk.

When you do

df.groupby(['a', 'b'])['type'].transform(some_func)

You are actually transforming each pd.Series object from each group into a new object using your some_func function. But the thing is, this new object should have the same size as the group OR be broadcastable to the size of the chunk.

Therefore, if you transform your series using tuple or list, you will be basically transforming the object

0    1
1 2
2 3
dtype: int64

into

[1,2,3]

But notice that these values are now assigned back to their respective indexes and that is why you see no difference in the transform operation. The row that had the .iloc[0] value from the pd.Series will now have the [1,2,3][0] value from the transform list (the same would apply to tuple) etc. Notice that ordering and size here matters, because otherwise you could mess up your groups and the transform wouldn't work (and this is exactly why set is not a proper function to be used is this case).


The second part of the quoted text says "broadcastable to the size of the group chunk".

This means that you can also transform your pd.Series to an object that can be used in all rows. For example

df.groupby(['a', 'b'])['type'].transform(lambda k: 50)

would work. Why? even though 50 is not iterable, it is broadcastable by using this value repeatedly in all positions of your initial pd.Series.


Why can you apply using set?

Because the apply method doesn't have this constraint of size in the result. It actually has three different result types, and it infers whether you want to expand, reduce or broadcast your results. Notice that you can't reduce in transforming*

By default (result_type=None), the final return type is inferred from the return type of the applied function.
result_type : {‘expand’, ‘reduce’, ‘broadcast’, None}, default None
These only act when axis=1 (columns):

  1. ‘expand’ : list-like results will be turned into columns.

  2. ‘reduce’ : returns a Series if possible rather than expanding list-like
    results. This is the opposite of ‘expand’.

  3. ‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.

Pandas transform() vs apply()

It looks like SeriesGroupBy.transform() tries to cast the result dtype to the same one as the original column has, but DataFrameGroupBy.transform() doesn't seem to do that:

In [139]: df.groupby('id')['cat'].transform(lambda x: (x == 1).any())
Out[139]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 0
8 0
9 1
Name: cat, dtype: int64

# v v
In [140]: df.groupby('id')[['cat']].transform(lambda x: (x == 1).any())
Out[140]:
cat
0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 False
8 False
9 True

In [141]: df.dtypes
Out[141]:
cat int64
id int64
dtype: object

Is there any difference between Series.transform and Series.apply?

I ended up posting this question on pandas GitHub. It turns out the behavior of these two methods can be different after all. Specifically, Series.apply() always passes a single cell as a function argument, while Series.transform() can pass the entire Series as a function argument in some cases. I'm still not sure how this property can possibly be applied, since whatever can be done on a whole Series through transform may just as well be done on that Series directly. My only guess is shorter cleaner code, but can't say for certain to be honest.

Transform gives different results when applied on individual groups rather than specifying after groupby

Although question might not be very clear, but still I think posting an answer would be better than deleting it.

So as I saw in above results when transform was applied on the whole Groupby object it returned the function applied on whole series and values duplicated whereas when I applied the function on individual series or groups it performed the transform function on each single element i.e. like the apply function of series.

After searching through the documentation and seeing the output of a custom function below this is what I get.

The groupby transform function directly passes the object to the function and checks its output whether it matches the length of passed object or it's a scaler in which it expands the output to that length.

But in series transform object, the function first tries to use apply function on the object and in case it fails then applies the function on whole object.

This is what I got after reading the source code, you can also see the output below, I created a function and called it on both transforms

def func(val):
print(type(val))
return ','.join(val.tolist())

# For series transforms
<class 'str'>
<class 'str'>

# For groupby transforms
<class 'pandas.core.series.Series'>

Now if I modify the function such that it can work only on whole series object and not on individual strings then observe how the series transform function behaves

# Modified function (cannot work only on strings)
def func(val):
print(type(val))
return val.str.split().str[0]

#For Series transforms
<class 'str'>
<class 'pandas.core.series.Series'>

Pandas groupby + transform and multiple columns

for this particular case you could do:

g = df.groupby(['c', 'd'])

df['e'] = g.a.transform('sum') + g.b.transform('sum')

df
# outputs

a b c d e
0 1 1 q z 12
1 2 2 q z 12
2 3 3 q z 12
3 4 4 q o 8
4 5 5 w o 22
5 6 6 w o 22

if you can construct the final result by a linear combination of the independent transforms on the same groupby, this method would work.

otherwise, you'd use a groupby-apply and then merge back to the original df.

example:

_ = df.groupby(['c','d']).apply(lambda x: sum(x.a+x.b)).rename('e').reset_index()
df.merge(_, on=['c','d'])
# same output as above.

pandas groupby transform custom function

One way I like to see what is happening is by creating a small custom function and printing out what is passed and its type. Then, you can see you have to work with.

def f(x):
print(type(x))
print('\n')
print(x)
print(x.index)
return df.loc[x.index,'d']*x

df['f'] = df.groupby('b')['c'].transform(f)
print(df)

#Output from print statements in function
<class 'pandas.core.series.Series'>

0 55.0
1 44.2
4 0.0
Name: b1, dtype: float64
Int64Index([0, 1, 4], dtype='int64')
<class 'pandas.core.series.Series'>

2 33.3
3 -66.5
Name: b2, dtype: float64
Int64Index([2, 3], dtype='int64')
#End output from print statements in custom function

a b c d e f
0 a1 b1 55.0 10 99.2 550.0
1 a2 b1 44.2 100 99.2 4420.0
2 a3 b2 33.3 1000 -33.2 33300.0
3 a4 b2 -66.5 10000 -33.2 -665000.0
4 a5 b1 0.0 100000 99.2 0.0

Here, I am transforming on column 'c' but I make an "extranal" call to the dataframe object in my custom function to get 'd'.

You can also pass the "external" to be used as an argument like this:

def f(x, col):
return df.loc[x.index, col]*x

df['g'] = df.groupby('b')['c'].transform(f, col='d')

print(df)

Output:

    a   b     c       d     e         f         g
0 a1 b1 55.0 10 99.2 550.0 550.0
1 a2 b1 44.2 100 99.2 4420.0 4420.0
2 a3 b2 33.3 1000 -33.2 33300.0 33300.0
3 a4 b2 -66.5 10000 -33.2 -665000.0 -665000.0
4 a5 b1 0.0 100000 99.2 0.0 0.0

Pandas group by returns NAN for apply vs transform function

I think you need dropna with apply, lambda should be omit:

df=df.dropna(subset=['size']).groupby('id')['size'].apply(', '.join).reset_index(name='col')

Or very similar:

df = df['size'].dropna().groupby(df['id']).apply(', '.join).reset_index(name='col')

Pandas Transforming the Applied Results back to the original dataframe

I do not have much to add to the excellent reference you provided on apply vs. transform, but you can do what you want without creating a separate dataframe, for example you can do

candy.groupby(['Name']).apply(lambda x: x.assign(Total_Chocolate_Spend = x[x['Candy'] == 'Chocolate']['Value'].sum()))

this uses assign for each group in groupby to populate Total_Chocolate_Spend with the number you want



Related Topics



Leave a reply



Submit