Pandas Groupby Multiple Fields Then Diff

Pandas groupby multiple fields then diff

First, sort the DataFrame and then all you need is groupby.diff():

df = df.sort_values(by=['site', 'country', 'date'])

df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)

df
Out:
date site country score diff
8 2018-01-01 fb es 100 0.0
9 2018-01-02 fb gb 100 0.0
5 2018-01-01 fb us 50 0.0
6 2018-01-02 fb us 55 5.0
7 2018-01-03 fb us 100 45.0
1 2018-01-01 google ch 50 0.0
4 2018-01-02 google ch 10 -40.0
0 2018-01-01 google us 100 0.0
2 2018-01-02 google us 70 -30.0
3 2018-01-03 google us 60 -10.0

sort_values doesn't support arbitrary orderings. If you need to sort arbitrarily (google before fb for example) you need to store them in a collection and set your column as categorical. Then sort_values will respect the ordering you provided there.

How to groupby multiple columns and aggregate diff on different columns?

  • It's important to sort df, because df.groupby will be sorted. If df isn't sorted first, the joined columns from .groupby will not be in the same order as df.
    • Be certain to df, in order, by 'state', 'country', and 'date', however, the 'date' column is ignored in .groupby.
  • .groupby on 'state' and 'country', and aggregate .diff to the desired columns.
  • fillna with 0, and .join df to the groupby object, based on the index.
    • Specify rsuffix, and or use .rename to change the column headers.
import pandas as pd

# setup the test dataframe
data = {'date': ['2020-06-13', '2020-06-14', '2020-06-15', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-12', '2020-06-13'],
'county': ['Bergen', 'Bergen', 'Bergen', 'Union', 'Union', 'Union', 'Bronx', 'Bronx'],
'state': ['New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New York', 'New York'],
'cnt_a': [308, 308, 320, 100, 130, 150, 200, 210],
'cnt_b': [11, 11, 15, 3, 4, 5, 100, 200]}

df = pd.DataFrame(data)

# set the date column to a datetime format
df.date = pd.to_datetime(df.date)

# sort the values
df = df.sort_values(['state', 'county', 'date'])

# groupby and join back to dataframe df
df = df.join(df.groupby(['state', 'county'])[['cnt_a', 'cnt_b']].diff().fillna(0), rsuffix='_diff')

# display(df)
date county state cnt_a cnt_b cnt_a_diff cnt_b_diff
0 2020-06-13 Bergen New Jersey 308 11 0.0 0.0
1 2020-06-14 Bergen New Jersey 308 11 0.0 0.0
2 2020-06-15 Bergen New Jersey 320 15 12.0 4.0
3 2020-06-12 Union New Jersey 100 3 0.0 0.0
4 2020-06-13 Union New Jersey 130 4 30.0 1.0
5 2020-06-14 Union New Jersey 150 5 20.0 1.0
6 2020-06-12 Bronx New York 200 100 0.0 0.0
7 2020-06-13 Bronx New York 210 200 10.0 100.0

Computing diffs within groups of a dataframe

wouldn't be just easier to do what yourself describe, namely

df.sort(['ticker', 'date'], inplace=True)
df['diffs'] = df['value'].diff()

and then correct for borders:

mask = df.ticker != df.ticker.shift(1)
df['diffs'][mask] = np.nan

to maintain the original index you may do idx = df.index in the beginning, and then at the end you can do df.reindex(idx), or if it is a huge dataframe, perform the operations on

df.filter(['ticker', 'date', 'value'])

and then join the two dataframes at the end.

edit: alternatively, ( though still not using groupby )

df.set_index(['ticker','date'], inplace=True)
df.sort_index(inplace=True)
df['diffs'] = np.nan

for idx in df.index.levels[0]:
df.diffs[idx] = df.value[idx].diff()

for

   date ticker  value
0 63 C 1.65
1 88 C -1.93
2 22 C -1.29
3 76 A -0.79
4 72 B -1.24
5 34 A -0.23
6 92 B 2.43
7 22 A 0.55
8 32 A -2.50
9 59 B -1.01

this will produce:

             value  diffs
ticker date
A 22 0.55 NaN
32 -2.50 -3.05
34 -0.23 2.27
76 -0.79 -0.56
B 59 -1.01 NaN
72 -1.24 -0.23
92 2.43 3.67
C 22 -1.29 NaN
63 1.65 2.94
88 -1.93 -3.58

Pandas DataFrame Groupby two columns and get counts

Followed by @Andy's answer, you can do following to solve your second question:

In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]:
0
col2
A 3
B 2
C 1
D 3

Groupby Pandas , calculate multiple columns based on date difference

You could do something like this:

def days_diff(sdf):
result = pd.DataFrame(
{"days_diff": pd.NaT, "A": None}, index=sdf.index
)
start = sdf.at[sdf.index[0], "Date"]
for index, day, next_MID_is_na in zip(
sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
):
diff = (day - start).days
if diff <= 30 and next_MID_is_na:
result.at[index, "days_diff"] = diff
else:
start = day
result.A = result.days_diff.isna().cumsum()
return result

df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()

Result for df created by

from io import StringIO
data = StringIO(
'''
CID RefID Date Group MID
100 1 1/01/2021 A
100 2 3/01/2021 A
100 3 4/01/2021 A 101
100 4 15/01/2021 A
100 5 18/01/2021 A
200 6 3/03/2021 B
200 7 4/04/2021 B
200 8 9/04/2021 B 102
200 9 25/04/2021 B
300 10 26/04/2021 C
300 11 27/05/2021 C
300 12 28/05/2021 C 103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")

is

    CID  RefID       Date Group    MID days_diff  A     B
0 100 1 2021-01-01 A NaN NaT 1 1.0
1 100 2 2021-01-03 A NaN 2 1 1.0
2 100 3 2021-01-04 A 101.0 3 1 1.0
3 100 4 2021-01-15 A NaN NaT 2 4.0
4 100 5 2021-01-18 A NaN 3 2 4.0
5 200 6 2021-03-03 B NaN NaT 1 6.0
6 200 7 2021-04-04 B NaN NaT 2 7.0
7 200 8 2021-04-09 B 102.0 5 2 7.0
8 200 9 2021-04-25 B NaN NaT 3 9.0
9 300 10 2021-04-26 C NaN NaT 1 10.0
10 300 11 2021-05-27 C NaN NaT 2 11.0
11 300 12 2021-05-28 C 103.0 1 2 11.0

A few explanations:

  • The function days_diff produces a dataframe with the two columns days_diff and A. It is applied to the grouped by column CID sub-dataframes of df.
  • First step: Initializing the result dataframe result (column days_diff filled with NaT, column A with None), and setting the starting value start for the day differences to the first day in the group.
  • Afterwards essentially looping over the sub-dataframe after the first index, thereby grabbing the index, the value in column Date, and a boolean value next_MID_is_na that signifies if the value of the MID column in the next row ist NaN (via .shift(1).isna()).
  • In every step of the loop:
    1. Calculation of the difference of the current day to the start day.
    2. Checking the rules for the days_diff column:
      • If difference of current and start day <= 30 days and NaN in next MID-row -> day-difference.
      • Otherwise -> reset of start to the current day.
  • After finishing column days_diff calculation of column A: result.days_diff.isna() is True (== 1) when days_diff is NaN, False (== 0) otherwise. Therefore the cummulative sum (.cumsum()) gives the required result.
  • After the groupby-apply to produce the columns days_diff and A finally the calculation of column B: Selection of RefID-values where the values A change (via .where(df.A != df.A.shift(1))), and then forward filling the remaining NaNs.

Apply multiple functions to multiple groupby columns

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})

a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})

a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401

Using apply and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494

Groupby a column and then compare two other columns and return a value in a different column

This solution compare all values between groups, not only prvious and next.

You can use custom lambda function in GroupBy.apply for boolean mask, for avoid MultiIndex is added group_keys=False to DataFrame.groupby, last set new values in numpy.where:

mask = md.groupby('COMB', group_keys=False).apply(lambda x: x['FROM'].isin(x['TO']))
md = md.assign(Type=np.where(mask,"R","O"))
print (md)
COMB FROM TO Type
0 PNR1 MAA BLR R
1 PNR1 BLR MAA R
2 PNR11 DEL MAA O
3 PNR2 TRV HYD R
4 PNR2 HYD TRV R

This solution compare previous and next rows per groups:

Another idea is use DataFrameGroupBy.shift, it should be faster like groupby.apply:

mask = (md.groupby('COMB')['FROM'].shift().eq(md['TO']) | 
md.groupby('COMB')['TO'].shift(-1).eq(md['FROM']))

md = md.assign(Type=np.where(mask,"R","O"))
print (md)
COMB FROM TO Type
0 PNR1 MAA BLR R
1 PNR1 BLR MAA R
2 PNR11 DEL MAA O
3 PNR2 TRV HYD R
4 PNR2 HYD TRV R

Pandas - dataframe groupby - how to get sum of multiple columns

By using apply

df.groupby(['col1', 'col2'])["col3", "col4"].apply(lambda x : x.astype(int).sum())
Out[1257]:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4

If you want to agg

df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})

Group by using 2 columns and if the time difference between the rows of third column is less than 2 seconds python pandas

In sample data is difference 5 seconds in last group (13:27:59 - 13:27:54 = 5seconds).

Solution is add DURATION in seconds to new column add and per groups get differencies by DataFrameGroupBy.diff with compare invert condition for greater like N with cumulative sum for new group column, last aggregate first and sum:

N = 5
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])

dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
f = lambda x: x.diff().dt.total_seconds().gt(N).cumsum()
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].transform(f)
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add g
0 190 390 2020-04-20 12:44:36 323 2020-04-20 12:49:59 0
1 282 811 2020-04-06 11:12:24 25 2020-04-06 11:12:49 0
2 495 414 2020-04-20 11:22:13 11 2020-04-20 11:22:24 0
3 827 158 2020-04-30 13:27:32 22 2020-04-30 13:27:54 0
4 827 158 2020-04-30 13:27:44 15 2020-04-30 13:27:59 0

dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())

print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 190 390 2020-04-20 12:44:36 323
1 282 811 2020-04-06 11:12:24 25
2 495 414 2020-04-20 11:22:13 11
3 827 158 2020-04-30 13:27:32 37

If need compare add shifted per groups by DATE_TIME column solution (with new data) is:

N = 2

dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])

dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)

dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add \
0 190 390 2020-04-20 12:44:36 323 2020-04-20 12:49:59
1 282 811 2020-04-06 11:12:24 25 2020-04-06 11:12:49
2 495 414 2020-04-20 11:22:13 11 2020-04-20 11:22:24
3 827 158 2020-04-30 13:27:22 22 2020-04-30 13:27:44
4 827 158 2020-04-30 13:27:44 15 2020-04-30 13:27:59

diff g
0 False 0
1 False 0
2 False 0
3 False 0
4 False 0


dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())

print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 190 390 2020-04-20 12:44:36 323
1 282 811 2020-04-06 11:12:24 25
2 495 414 2020-04-20 11:22:13 11
3 827 158 2020-04-30 13:27:22 37

Tested 3rd sample:

N = 2

dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])

dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)

dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add \
0 441785807190 4299330390 2020-04-20 12:44:36 323 2020-04-20 12:49:59
1 96897940827 3139578158 2020-04-30 13:27:00 33 2020-04-30 13:27:33
2 441785808282 4238900811 2020-04-06 11:12:24 25 2020-04-06 11:12:49
3 14244012495 3104405414 2020-04-20 11:22:13 11 2020-04-20 11:22:24
4 96897940827 3139578158 2020-04-30 13:27:34 16 2020-04-30 13:27:50

diff g
0 False 0
1 False 0
2 False 0
3 False 0
4 False 0


dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())

print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 14244012495 3104405414 2020-04-20 11:22:13 11
1 96897940827 3139578158 2020-04-30 13:27:00 49
2 441785807190 4299330390 2020-04-20 12:44:36 323
3 441785808282 4238900811 2020-04-06 11:12:24 25


Related Topics



Leave a reply



Submit