Pandas Groupby Multiple Fields Then Diff

Pandas groupby multiple fields then diff

First, sort the DataFrame and then all you need is groupby.diff():

df = df.sort_values(by=['site', 'country', 'date'])

df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)

df
Out: 
         date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

sort_values doesn't support arbitrary orderings. If you need to sort arbitrarily (google before fb for example) you need to store them in a collection and set your column as categorical. Then sort_values will respect the ordering you provided there.

How to groupby multiple columns and aggregate diff on different columns?

It's important to sort df, because df.groupby will be sorted. If df isn't sorted first, the joined columns from .groupby will not be in the same order as df.
- Be certain to df, in order, by 'state', 'country', and 'date', however, the 'date' column is ignored in .groupby.
.groupby on 'state' and 'country', and aggregate .diff to the desired columns.
fillna with 0, and .join df to the groupby object, based on the index.
- Specify rsuffix, and or use .rename to change the column headers.

import pandas as pd

# setup the test dataframe
data = {'date': ['2020-06-13', '2020-06-14', '2020-06-15', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-12', '2020-06-13'],
        'county': ['Bergen', 'Bergen', 'Bergen', 'Union', 'Union', 'Union', 'Bronx', 'Bronx'],
        'state': ['New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New York', 'New York'],
        'cnt_a': [308, 308, 320, 100, 130, 150, 200, 210],
        'cnt_b': [11, 11, 15, 3, 4, 5, 100, 200]}

df = pd.DataFrame(data)

# set the date column to a datetime format
df.date = pd.to_datetime(df.date)

# sort the values
df = df.sort_values(['state', 'county', 'date'])

# groupby and join back to dataframe df
df = df.join(df.groupby(['state', 'county'])[['cnt_a', 'cnt_b']].diff().fillna(0), rsuffix='_diff')

# display(df)
        date  county       state  cnt_a  cnt_b  cnt_a_diff  cnt_b_diff
0 2020-06-13  Bergen  New Jersey    308     11         0.0         0.0
1 2020-06-14  Bergen  New Jersey    308     11         0.0         0.0
2 2020-06-15  Bergen  New Jersey    320     15        12.0         4.0
3 2020-06-12   Union  New Jersey    100      3         0.0         0.0
4 2020-06-13   Union  New Jersey    130      4        30.0         1.0
5 2020-06-14   Union  New Jersey    150      5        20.0         1.0
6 2020-06-12   Bronx    New York    200    100         0.0         0.0
7 2020-06-13   Bronx    New York    210    200        10.0       100.0

Computing diffs within groups of a dataframe

wouldn't be just easier to do what yourself describe, namely

df.sort(['ticker', 'date'], inplace=True)
df['diffs'] = df['value'].diff()

and then correct for borders:

mask = df.ticker != df.ticker.shift(1)
df['diffs'][mask] = np.nan

to maintain the original index you may do idx = df.index in the beginning, and then at the end you can do df.reindex(idx), or if it is a huge dataframe, perform the operations on

df.filter(['ticker', 'date', 'value'])

and then join the two dataframes at the end.

edit: alternatively, ( though still not using groupby )

df.set_index(['ticker','date'], inplace=True)
df.sort_index(inplace=True)
df['diffs'] = np.nan 

for idx in df.index.levels[0]:
    df.diffs[idx] = df.value[idx].diff()

for

   date ticker  value
0    63      C   1.65
1    88      C  -1.93
2    22      C  -1.29
3    76      A  -0.79
4    72      B  -1.24
5    34      A  -0.23
6    92      B   2.43
7    22      A   0.55
8    32      A  -2.50
9    59      B  -1.01

this will produce:

             value  diffs
ticker date              
A      22     0.55    NaN
       32    -2.50  -3.05
       34    -0.23   2.27
       76    -0.79  -0.56
B      59    -1.01    NaN
       72    -1.24  -0.23
       92     2.43   3.67
C      22    -1.29    NaN
       63     1.65   2.94
       88    -1.93  -3.58

Pandas DataFrame Groupby two columns and get counts

Followed by @Andy's answer, you can do following to solve your second question:

In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]: 
      0
col2   
A     3
B     2
C     1
D     3

Groupby Pandas , calculate multiple columns based on date difference

You could do something like this:

def days_diff(sdf):
    result = pd.DataFrame(
        {"days_diff": pd.NaT, "A": None}, index=sdf.index
    )
    start = sdf.at[sdf.index[0], "Date"]
    for index, day, next_MID_is_na in zip(
        sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
    ):
        diff = (day - start).days
        if diff <= 30 and next_MID_is_na:
            result.at[index, "days_diff"] = diff
        else:
            start = day
    result.A = result.days_diff.isna().cumsum()
    return result

df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()

Result for df created by

from io import StringIO
data = StringIO(
'''
CID RefID   Date        Group   MID 
100     1   1/01/2021       A                       
100     2   3/01/2021       A                       
100     3   4/01/2021       A   101             
100     4   15/01/2021      A                           
100     5   18/01/2021      A                   
200     6   3/03/2021       B                       
200     7   4/04/2021       B                       
200     8   9/04/2021       B   102             
200     9   25/04/2021      B                       
300     10  26/04/2021      C                       
300     11  27/05/2021      C           
300     12  28/05/2021      C   103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")

    CID  RefID       Date Group    MID days_diff  A     B
0   100      1 2021-01-01     A    NaN       NaT  1   1.0
1   100      2 2021-01-03     A    NaN         2  1   1.0
2   100      3 2021-01-04     A  101.0         3  1   1.0
3   100      4 2021-01-15     A    NaN       NaT  2   4.0
4   100      5 2021-01-18     A    NaN         3  2   4.0
5   200      6 2021-03-03     B    NaN       NaT  1   6.0
6   200      7 2021-04-04     B    NaN       NaT  2   7.0
7   200      8 2021-04-09     B  102.0         5  2   7.0
8   200      9 2021-04-25     B    NaN       NaT  3   9.0
9   300     10 2021-04-26     C    NaN       NaT  1  10.0
10  300     11 2021-05-27     C    NaN       NaT  2  11.0
11  300     12 2021-05-28     C  103.0         1  2  11.0

A few explanations:

The function days_diff produces a dataframe with the two columns days_diff and A. It is applied to the grouped by column CID sub-dataframes of df.
First step: Initializing the result dataframe result (column days_diff filled with NaT, column A with None), and setting the starting value start for the day differences to the first day in the group.
Afterwards essentially looping over the sub-dataframe after the first index, thereby grabbing the index, the value in column Date, and a boolean value next_MID_is_na that signifies if the value of the MID column in the next row ist NaN (via .shift(1).isna()).
In every step of the loop:
1. Calculation of the difference of the current day to the start day.
2. Checking the rules for the days_diff column:
  - If difference of current and start day <= 30 days and NaN in next MID-row -> day-difference.
  - Otherwise -> reset of start to the current day.
After finishing column days_diff calculation of column A: result.days_diff.isna() is True (== 1) when days_diff is NaN, False (== 0) otherwise. Therefore the cummulative sum (.cumsum()) gives the required result.
After the groupby-apply to produce the columns days_diff and A finally the calculation of column B: Selection of RefID-values where the values A change (via .where(df.A != df.A.shift(1))), and then forward filling the remaining NaNs.

Apply multiple functions to multiple groupby columns

The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg groupby method. Second, never use .ix.

If you desire to work with two separate columns at the same time I would suggest using the apply method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above

df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df

          a         b         c         d  group
0  0.418500  0.030955  0.874869  0.145641      0
1  0.446069  0.901153  0.095052  0.487040      0
2  0.843026  0.936169  0.926090  0.041722      1
3  0.635846  0.439175  0.828787  0.714123      1

A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': lambda x: x.max() - x.min()})

              a                   b         c         d
            sum       max      mean       sum  <lambda>
group                                                  
0      0.864569  0.446069  0.466054  0.969921  0.341399
1      1.478872  0.843026  0.687672  1.754877  0.672401

If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__ attribute like this:

def max_min(x):
    return x.max() - x.min()

max_min.__name__ = 'Max minus Min'

df.groupby('group').agg({'a':['sum', 'max'], 
                         'b':'mean', 
                         'c':'sum', 
                         'd': max_min})

              a                   b         c             d
            sum       max      mean       sum Max minus Min
group                                                      
0      0.864569  0.446069  0.466054  0.969921      0.341399
1      1.478872  0.843026  0.687672  1.754877      0.672401

Using `apply` and returning a Series

Now, if you had multiple columns that needed to interact together then you cannot use agg, which implicitly passes a Series to the aggregating function. When using apply the entire group as a DataFrame gets passed into the function.

I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:

def f(x):
    d = {}
    d['a_sum'] = x['a'].sum()
    d['a_max'] = x['a'].max()
    d['b_mean'] = x['b'].mean()
    d['c_d_prodsum'] = (x['c'] * x['d']).sum()
    return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])

df.groupby('group').apply(f)

         a_sum     a_max    b_mean  c_d_prodsum
group                                           
0      0.864569  0.446069  0.466054     0.173711
1      1.478872  0.843026  0.687672     0.630494

If you are in love with MultiIndexes, you can still return a Series with one like this:

    def f_mi(x):
        d = []
        d.append(x['a'].sum())
        d.append(x['a'].max())
        d.append(x['b'].mean())
        d.append((x['c'] * x['d']).sum())
        return pd.Series(d, index=[['a', 'a', 'b', 'c_d'], 
                                   ['sum', 'max', 'mean', 'prodsum']])

df.groupby('group').apply(f_mi)

              a                   b       c_d
            sum       max      mean   prodsum
group                                        
0      0.864569  0.446069  0.466054  0.173711
1      1.478872  0.843026  0.687672  0.630494

Groupby a column and then compare two other columns and return a value in a different column

This solution compare all values between groups, not only prvious and next.

You can use custom lambda function in GroupBy.apply for boolean mask, for avoid MultiIndex is added group_keys=False to DataFrame.groupby, last set new values in numpy.where:

mask = md.groupby('COMB', group_keys=False).apply(lambda x: x['FROM'].isin(x['TO']))
md = md.assign(Type=np.where(mask,"R","O"))
print (md)
    COMB FROM   TO Type
0   PNR1  MAA  BLR    R
1   PNR1  BLR  MAA    R
2  PNR11  DEL  MAA    O
3   PNR2  TRV  HYD    R
4   PNR2  HYD  TRV    R

This solution compare previous and next rows per groups:

Another idea is use DataFrameGroupBy.shift, it should be faster like groupby.apply:

mask = (md.groupby('COMB')['FROM'].shift().eq(md['TO']) | 
        md.groupby('COMB')['TO'].shift(-1).eq(md['FROM']))

md = md.assign(Type=np.where(mask,"R","O"))
print (md)
    COMB FROM   TO Type
0   PNR1  MAA  BLR    R
1   PNR1  BLR  MAA    R
2  PNR11  DEL  MAA    O
3   PNR2  TRV  HYD    R
4   PNR2  HYD  TRV    R

Pandas - dataframe groupby - how to get sum of multiple columns

By using apply

df.groupby(['col1', 'col2'])["col3", "col4"].apply(lambda x : x.astype(int).sum())
Out[1257]: 
           col3  col4
col1 col2            
a    c        2     4
     d        1     2
b    d        1     2
     e        2     4

If you want to agg

df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})

Group by using 2 columns and if the time difference between the rows of third column is less than 2 seconds python pandas

In sample data is difference 5 seconds in last group (13:27:59 - 13:27:54 = 5seconds).

Solution is add DURATION in seconds to new column add and per groups get differencies by DataFrameGroupBy.diff with compare invert condition for greater like N with cumulative sum for new group column, last aggregate first and sum:

N = 5
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])

dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
f = lambda x: x.diff().dt.total_seconds().gt(N).cumsum()
dataframe['g'] =  dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].transform(f)
print (dataframe)
   A_PERSON  B_PERSON           DATE_TIME  DURATION                 add  g
0       190       390 2020-04-20 12:44:36       323 2020-04-20 12:49:59  0
1       282       811 2020-04-06 11:12:24        25 2020-04-06 11:12:49  0
2       495       414 2020-04-20 11:22:13        11 2020-04-20 11:22:24  0
3       827       158 2020-04-30 13:27:32        22 2020-04-30 13:27:54  0
4       827       158 2020-04-30 13:27:44        15 2020-04-30 13:27:59  0

dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
                      .agg({'DATE_TIME':'first', 'DURATION':'sum'})
                      .droplevel(-1)
                      .reset_index())

print (dataframe)
   A_PERSON  B_PERSON           DATE_TIME  DURATION
0       190       390 2020-04-20 12:44:36       323
1       282       811 2020-04-06 11:12:24        25
2       495       414 2020-04-20 11:22:13        11
3       827       158 2020-04-30 13:27:32        37

If need compare add shifted per groups by DATE_TIME column solution (with new data) is:

N = 2

dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])

dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)

dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
   A_PERSON  B_PERSON           DATE_TIME  DURATION                 add  \
0       190       390 2020-04-20 12:44:36       323 2020-04-20 12:49:59   
1       282       811 2020-04-06 11:12:24        25 2020-04-06 11:12:49   
2       495       414 2020-04-20 11:22:13        11 2020-04-20 11:22:24   
3       827       158 2020-04-30 13:27:22        22 2020-04-30 13:27:44   
4       827       158 2020-04-30 13:27:44        15 2020-04-30 13:27:59   

    diff  g  
0  False  0  
1  False  0  
2  False  0  
3  False  0  
4  False  0

dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
                      .agg({'DATE_TIME':'first', 'DURATION':'sum'})
                      .droplevel(-1)
                      .reset_index())

print (dataframe)
   A_PERSON  B_PERSON           DATE_TIME  DURATION
0       190       390 2020-04-20 12:44:36       323
1       282       811 2020-04-06 11:12:24        25
2       495       414 2020-04-20 11:22:13        11
3       827       158 2020-04-30 13:27:22        37

Tested 3rd sample:

N = 2

dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])

dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)

dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
       A_PERSON    B_PERSON           DATE_TIME  DURATION                 add  \
0  441785807190  4299330390 2020-04-20 12:44:36       323 2020-04-20 12:49:59   
1   96897940827  3139578158 2020-04-30 13:27:00        33 2020-04-30 13:27:33   
2  441785808282  4238900811 2020-04-06 11:12:24        25 2020-04-06 11:12:49   
3   14244012495  3104405414 2020-04-20 11:22:13        11 2020-04-20 11:22:24   
4   96897940827  3139578158 2020-04-30 13:27:34        16 2020-04-30 13:27:50   

    diff  g  
0  False  0  
1  False  0  
2  False  0  
3  False  0  
4  False  0

dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
                      .agg({'DATE_TIME':'first', 'DURATION':'sum'})
                      .droplevel(-1)
                      .reset_index())

print (dataframe)
       A_PERSON    B_PERSON           DATE_TIME  DURATION
0   14244012495  3104405414 2020-04-20 11:22:13        11
1   96897940827  3139578158 2020-04-30 13:27:00        49
2  441785807190  4299330390 2020-04-20 12:44:36       323
3  441785808282  4238900811 2020-04-06 11:12:24        25

Pandas Groupby Multiple Fields Then Diff