Pandas groupby multiple fields then diff
First, sort the DataFrame and then all you need is groupby.diff()
:
df = df.sort_values(by=['site', 'country', 'date'])
df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)
df
Out:
date site country score diff
8 2018-01-01 fb es 100 0.0
9 2018-01-02 fb gb 100 0.0
5 2018-01-01 fb us 50 0.0
6 2018-01-02 fb us 55 5.0
7 2018-01-03 fb us 100 45.0
1 2018-01-01 google ch 50 0.0
4 2018-01-02 google ch 10 -40.0
0 2018-01-01 google us 100 0.0
2 2018-01-02 google us 70 -30.0
3 2018-01-03 google us 60 -10.0
sort_values
doesn't support arbitrary orderings. If you need to sort arbitrarily (google before fb for example) you need to store them in a collection and set your column as categorical. Then sort_values will respect the ordering you provided there.
How to groupby multiple columns and aggregate diff on different columns?
- It's important to sort
df
, becausedf.groupby
will be sorted. Ifdf
isn't sorted first, the joined columns from.groupby
will not be in the same order asdf
.- Be certain to
df
, in order, by'state'
,'country'
, and'date'
, however, the'date'
column is ignored in.groupby
.
- Be certain to
.groupby
on'state'
and'country'
, and aggregate.diff
to the desired columns.fillna
with 0, and.join
df
to the groupby object, based on the index.- Specify
rsuffix
, and or use.rename
to change the column headers.
- Specify
import pandas as pd
# setup the test dataframe
data = {'date': ['2020-06-13', '2020-06-14', '2020-06-15', '2020-06-12', '2020-06-13', '2020-06-14', '2020-06-12', '2020-06-13'],
'county': ['Bergen', 'Bergen', 'Bergen', 'Union', 'Union', 'Union', 'Bronx', 'Bronx'],
'state': ['New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New Jersey', 'New York', 'New York'],
'cnt_a': [308, 308, 320, 100, 130, 150, 200, 210],
'cnt_b': [11, 11, 15, 3, 4, 5, 100, 200]}
df = pd.DataFrame(data)
# set the date column to a datetime format
df.date = pd.to_datetime(df.date)
# sort the values
df = df.sort_values(['state', 'county', 'date'])
# groupby and join back to dataframe df
df = df.join(df.groupby(['state', 'county'])[['cnt_a', 'cnt_b']].diff().fillna(0), rsuffix='_diff')
# display(df)
date county state cnt_a cnt_b cnt_a_diff cnt_b_diff
0 2020-06-13 Bergen New Jersey 308 11 0.0 0.0
1 2020-06-14 Bergen New Jersey 308 11 0.0 0.0
2 2020-06-15 Bergen New Jersey 320 15 12.0 4.0
3 2020-06-12 Union New Jersey 100 3 0.0 0.0
4 2020-06-13 Union New Jersey 130 4 30.0 1.0
5 2020-06-14 Union New Jersey 150 5 20.0 1.0
6 2020-06-12 Bronx New York 200 100 0.0 0.0
7 2020-06-13 Bronx New York 210 200 10.0 100.0
Computing diffs within groups of a dataframe
wouldn't be just easier to do what yourself describe, namely
df.sort(['ticker', 'date'], inplace=True)
df['diffs'] = df['value'].diff()
and then correct for borders:
mask = df.ticker != df.ticker.shift(1)
df['diffs'][mask] = np.nan
to maintain the original index you may do idx = df.index
in the beginning, and then at the end you can do df.reindex(idx)
, or if it is a huge dataframe, perform the operations on
df.filter(['ticker', 'date', 'value'])
and then join
the two dataframes at the end.
edit: alternatively, ( though still not using groupby
)
df.set_index(['ticker','date'], inplace=True)
df.sort_index(inplace=True)
df['diffs'] = np.nan
for idx in df.index.levels[0]:
df.diffs[idx] = df.value[idx].diff()
for
date ticker value
0 63 C 1.65
1 88 C -1.93
2 22 C -1.29
3 76 A -0.79
4 72 B -1.24
5 34 A -0.23
6 92 B 2.43
7 22 A 0.55
8 32 A -2.50
9 59 B -1.01
this will produce:
value diffs
ticker date
A 22 0.55 NaN
32 -2.50 -3.05
34 -0.23 2.27
76 -0.79 -0.56
B 59 -1.01 NaN
72 -1.24 -0.23
92 2.43 3.67
C 22 -1.29 NaN
63 1.65 2.94
88 -1.93 -3.58
Pandas DataFrame Groupby two columns and get counts
Followed by @Andy's answer, you can do following to solve your second question:
In [56]: df.groupby(['col5','col2']).size().reset_index().groupby('col2')[[0]].max()
Out[56]:
0
col2
A 3
B 2
C 1
D 3
Groupby Pandas , calculate multiple columns based on date difference
You could do something like this:
def days_diff(sdf):
result = pd.DataFrame(
{"days_diff": pd.NaT, "A": None}, index=sdf.index
)
start = sdf.at[sdf.index[0], "Date"]
for index, day, next_MID_is_na in zip(
sdf.index[1:], sdf.Date[1:], sdf.MID.shift(1).isna()[1:]
):
diff = (day - start).days
if diff <= 30 and next_MID_is_na:
result.at[index, "days_diff"] = diff
else:
start = day
result.A = result.days_diff.isna().cumsum()
return result
df[["days_diff", "A"]] = df[["CID", "Date", "MID"]].groupby("CID").apply(days_diff)
df["B"] = df.RefID.where(df.A != df.A.shift(1)).ffill()
Result for df
created by
from io import StringIO
data = StringIO(
'''
CID RefID Date Group MID
100 1 1/01/2021 A
100 2 3/01/2021 A
100 3 4/01/2021 A 101
100 4 15/01/2021 A
100 5 18/01/2021 A
200 6 3/03/2021 B
200 7 4/04/2021 B
200 8 9/04/2021 B 102
200 9 25/04/2021 B
300 10 26/04/2021 C
300 11 27/05/2021 C
300 12 28/05/2021 C 103
''')
df = pd.read_csv(data, delim_whitespace=True)
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")
is
CID RefID Date Group MID days_diff A B
0 100 1 2021-01-01 A NaN NaT 1 1.0
1 100 2 2021-01-03 A NaN 2 1 1.0
2 100 3 2021-01-04 A 101.0 3 1 1.0
3 100 4 2021-01-15 A NaN NaT 2 4.0
4 100 5 2021-01-18 A NaN 3 2 4.0
5 200 6 2021-03-03 B NaN NaT 1 6.0
6 200 7 2021-04-04 B NaN NaT 2 7.0
7 200 8 2021-04-09 B 102.0 5 2 7.0
8 200 9 2021-04-25 B NaN NaT 3 9.0
9 300 10 2021-04-26 C NaN NaT 1 10.0
10 300 11 2021-05-27 C NaN NaT 2 11.0
11 300 12 2021-05-28 C 103.0 1 2 11.0
A few explanations:
- The function
days_diff
produces a dataframe with the two columnsdays_diff
andA
. It is applied to the grouped by columnCID
sub-dataframes ofdf
. - First step: Initializing the result dataframe
result
(columndays_diff
filled withNaT
, columnA
withNone
), and setting the starting valuestart
for the day differences to the first day in the group. - Afterwards essentially looping over the sub-dataframe after the first index, thereby grabbing the index, the value in column
Date
, and a boolean valuenext_MID_is_na
that signifies if the value of theMID
column in the next row istNaN
(via.shift(1).isna()
). - In every step of the loop:
- Calculation of the difference of the current day to the start day.
- Checking the rules for the
days_diff
column:- If difference of current and start day <= 30 days and
NaN
in nextMID
-row -> day-difference. - Otherwise -> reset of
start
to the current day.
- If difference of current and start day <= 30 days and
- After finishing column
days_diff
calculation of columnA
:result.days_diff.isna()
isTrue
(== 1
) whendays_diff
isNaN
,False
(== 0
) otherwise. Therefore the cummulative sum (.cumsum()
) gives the required result. - After the
groupby-apply
to produce the columnsdays_diff
andA
finally the calculation of columnB
: Selection ofRefID
-values where the valuesA
change (via.where(df.A != df.A.shift(1))
), and then forward filling the remainingNaN
s.
Apply multiple functions to multiple groupby columns
The second half of the currently accepted answer is outdated and has two deprecations. First and most important, you can no longer pass a dictionary of dictionaries to the agg
groupby method. Second, never use .ix
.
If you desire to work with two separate columns at the same time I would suggest using the apply
method which implicitly passes a DataFrame to the applied function. Let's use a similar dataframe as the one from above
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['group'] = [0, 0, 1, 1]
df
a b c d group
0 0.418500 0.030955 0.874869 0.145641 0
1 0.446069 0.901153 0.095052 0.487040 0
2 0.843026 0.936169 0.926090 0.041722 1
3 0.635846 0.439175 0.828787 0.714123 1
A dictionary mapped from column names to aggregation functions is still a perfectly good way to perform an aggregation.
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': lambda x: x.max() - x.min()})
a b c d
sum max mean sum <lambda>
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
If you don't like that ugly lambda column name, you can use a normal function and supply a custom name to the special __name__
attribute like this:
def max_min(x):
return x.max() - x.min()
max_min.__name__ = 'Max minus Min'
df.groupby('group').agg({'a':['sum', 'max'],
'b':'mean',
'c':'sum',
'd': max_min})
a b c d
sum max mean sum Max minus Min
group
0 0.864569 0.446069 0.466054 0.969921 0.341399
1 1.478872 0.843026 0.687672 1.754877 0.672401
Using apply
and returning a Series
Now, if you had multiple columns that needed to interact together then you cannot use agg
, which implicitly passes a Series to the aggregating function. When using apply
the entire group as a DataFrame gets passed into the function.
I recommend making a single custom function that returns a Series of all the aggregations. Use the Series index as labels for the new columns:
def f(x):
d = {}
d['a_sum'] = x['a'].sum()
d['a_max'] = x['a'].max()
d['b_mean'] = x['b'].mean()
d['c_d_prodsum'] = (x['c'] * x['d']).sum()
return pd.Series(d, index=['a_sum', 'a_max', 'b_mean', 'c_d_prodsum'])
df.groupby('group').apply(f)
a_sum a_max b_mean c_d_prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
If you are in love with MultiIndexes, you can still return a Series with one like this:
def f_mi(x):
d = []
d.append(x['a'].sum())
d.append(x['a'].max())
d.append(x['b'].mean())
d.append((x['c'] * x['d']).sum())
return pd.Series(d, index=[['a', 'a', 'b', 'c_d'],
['sum', 'max', 'mean', 'prodsum']])
df.groupby('group').apply(f_mi)
a b c_d
sum max mean prodsum
group
0 0.864569 0.446069 0.466054 0.173711
1 1.478872 0.843026 0.687672 0.630494
Groupby a column and then compare two other columns and return a value in a different column
This solution compare all values between groups, not only prvious and next.
You can use custom lambda function in GroupBy.apply
for boolean mask, for avoid MultiIndex
is added group_keys=False
to DataFrame.groupby
, last set new values in numpy.where
:
mask = md.groupby('COMB', group_keys=False).apply(lambda x: x['FROM'].isin(x['TO']))
md = md.assign(Type=np.where(mask,"R","O"))
print (md)
COMB FROM TO Type
0 PNR1 MAA BLR R
1 PNR1 BLR MAA R
2 PNR11 DEL MAA O
3 PNR2 TRV HYD R
4 PNR2 HYD TRV R
This solution compare previous and next rows per groups:
Another idea is use DataFrameGroupBy.shift
, it should be faster like groupby.apply
:
mask = (md.groupby('COMB')['FROM'].shift().eq(md['TO']) |
md.groupby('COMB')['TO'].shift(-1).eq(md['FROM']))
md = md.assign(Type=np.where(mask,"R","O"))
print (md)
COMB FROM TO Type
0 PNR1 MAA BLR R
1 PNR1 BLR MAA R
2 PNR11 DEL MAA O
3 PNR2 TRV HYD R
4 PNR2 HYD TRV R
Pandas - dataframe groupby - how to get sum of multiple columns
By using apply
df.groupby(['col1', 'col2'])["col3", "col4"].apply(lambda x : x.astype(int).sum())
Out[1257]:
col3 col4
col1 col2
a c 2 4
d 1 2
b d 1 2
e 2 4
If you want to agg
df.groupby(['col1', 'col2']).agg({'col3':'sum','col4':'sum'})
Group by using 2 columns and if the time difference between the rows of third column is less than 2 seconds python pandas
In sample data is difference 5
seconds in last group (13:27:59 - 13:27:54 = 5seconds
).
Solution is add DURATION
in seconds to new column add
and per groups get differencies by DataFrameGroupBy.diff
with compare invert condition for greater like N
with cumulative sum for new group column, last aggregate first
and sum
:
N = 5
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
f = lambda x: x.diff().dt.total_seconds().gt(N).cumsum()
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].transform(f)
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add g
0 190 390 2020-04-20 12:44:36 323 2020-04-20 12:49:59 0
1 282 811 2020-04-06 11:12:24 25 2020-04-06 11:12:49 0
2 495 414 2020-04-20 11:22:13 11 2020-04-20 11:22:24 0
3 827 158 2020-04-30 13:27:32 22 2020-04-30 13:27:54 0
4 827 158 2020-04-30 13:27:44 15 2020-04-30 13:27:59 0
dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 190 390 2020-04-20 12:44:36 323
1 282 811 2020-04-06 11:12:24 25
2 495 414 2020-04-20 11:22:13 11
3 827 158 2020-04-30 13:27:32 37
If need compare add
shifted per groups by DATE_TIME
column solution (with new data) is:
N = 2
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add \
0 190 390 2020-04-20 12:44:36 323 2020-04-20 12:49:59
1 282 811 2020-04-06 11:12:24 25 2020-04-06 11:12:49
2 495 414 2020-04-20 11:22:13 11 2020-04-20 11:22:24
3 827 158 2020-04-30 13:27:22 22 2020-04-30 13:27:44
4 827 158 2020-04-30 13:27:44 15 2020-04-30 13:27:59
diff g
0 False 0
1 False 0
2 False 0
3 False 0
4 False 0
dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 190 390 2020-04-20 12:44:36 323
1 282 811 2020-04-06 11:12:24 25
2 495 414 2020-04-20 11:22:13 11
3 827 158 2020-04-30 13:27:22 37
Tested 3rd sample:
N = 2
dataframe['DATE_TIME'] = pd.to_datetime(dataframe['DATE_TIME'])
dataframe['add'] = dataframe['DATE_TIME'] + pd.to_timedelta(dataframe['DURATION'],unit='s')
dataframe['diff'] = dataframe['DATE_TIME'].sub(dataframe.groupby(["A_PERSON", "B_PERSON"])['add'].shift()).dt.total_seconds().gt(N)
dataframe['g'] = dataframe.groupby(["A_PERSON", "B_PERSON"])['diff'].cumsum()
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION add \
0 441785807190 4299330390 2020-04-20 12:44:36 323 2020-04-20 12:49:59
1 96897940827 3139578158 2020-04-30 13:27:00 33 2020-04-30 13:27:33
2 441785808282 4238900811 2020-04-06 11:12:24 25 2020-04-06 11:12:49
3 14244012495 3104405414 2020-04-20 11:22:13 11 2020-04-20 11:22:24
4 96897940827 3139578158 2020-04-30 13:27:34 16 2020-04-30 13:27:50
diff g
0 False 0
1 False 0
2 False 0
3 False 0
4 False 0
dataframe = (dataframe.groupby(["A_PERSON", "B_PERSON", 'g'])
.agg({'DATE_TIME':'first', 'DURATION':'sum'})
.droplevel(-1)
.reset_index())
print (dataframe)
A_PERSON B_PERSON DATE_TIME DURATION
0 14244012495 3104405414 2020-04-20 11:22:13 11
1 96897940827 3139578158 2020-04-30 13:27:00 49
2 441785807190 4299330390 2020-04-20 12:44:36 323
3 441785808282 4238900811 2020-04-06 11:12:24 25
Related Topics
How to Find the Last Occurrence of an Item in a Python List
How to Install Psycopg2 with "Pip" on Python
Running Selenium with Headless Chrome Webdriver
How to Check a String for Specific Characters
Access Elementtree Node Parent Node
Create a List with Initial Capacity in Python
How to Add an Image in Tkinter
How to Use Angularjs with the Jinja2 Template Engine
How to Print an Exception in Python
Find_Element_By_* Commands Are Deprecated in Selenium
Df.Append() Is Not Appending to the Dataframe
Python Process Pool Non-Daemonic
Color by Column Values in Matplotlib
What Are the Differences Between JSON and Simplejson Python Modules
Pandas Dataframe Stored List as String: How to Convert Back to List