Pandas: Conditional Rolling Count

Pandas Conditional Rolling Count

Actually, your codes to set up xmast and Lxmast can be much simplified if you had used the solution with the highest upvotes in the referenced question.

Renaming your dataframe cowmast to df, you can set up xmast as follows:

df['xmast'] = df.groupby((df['Cow'] != df['Cow'].shift(1)).cumsum()).cumcount()+1

Similarly, to set up Lxmast, you can use:

df['Lxmast'] = (df.groupby([(df['Cow'] != df['Cow'].shift(1)).cumsum(), 
(df['Lact'] != df['Lact'].shift()).cumsum()])
.cumcount()+1
)

Data Input

l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))

cowmast.columns =['Cow', 'Lact', 'DIM']

df = cowmast

Output

print(df)

Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1

Now, continue with the last part of your requirement highlighted in bold below:

What I would like to do is restart the count for each cow (cow)
lactation (Lact) and only increment the count when the number of days
(DIM) between rows is more than 7
.

we can do it as follows:

To make the codes more readable, let's define 2 grouping sequences for the codes we have so far:

m_Cow = (df['Cow'] != df['Cow'].shift()).cumsum()
m_Lact = (df['Lact'] != df['Lact'].shift()).cumsum()

Then, we can rewrite the codes to set up Lxmast in a more readable format, as follows:

df['Lxmast'] = df.groupby([m_Cow, m_Lact]).cumcount()+1

Now, turn to the main works here. Let's say we create another new column Adjusted for it:

df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().abs().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)

Result:

print(df)

Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1

Here, after df.groupby([m_Cow, m_Lact]), we take the column DIM and check for each row's difference with previous row by .diff() and take the absolute value by .abs(), then check whether it is > 7 by .gt(7) in the code fragment ['DIM'].diff().abs().gt(7). We then group by the same grouping again .groupby([m_Cow, m_Lact]) since this 3rd condition is within the grouping of the first 2 conditions. The final step we use .cumsum() on the 3rd condition, so that only when the 3rd condition is true we increment the count.

Just in case you want to increment the count only when the DIM is inreased by > 7 only (e.g. 70 to 78) and exclude the case decreased by > 7 (not from 78 to 70), you can remove the .abs() part in the codes above:

df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)

Edit (Possible simplification depending on your data sequence)

As your sample data have the main grouping keys Cow and Lact somewhat already in sorted sequence, there's opportunity for further simplification of the codes.

Different from the sample data from the referenced question, where:

   col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1

Here, the last B in the last row is separated from other B's and it required the count be reset to 1 rather than continuing from the last count of 2 of the previous B (to become 3). Hence, the grouping needs to compare current row with previous row to get the correct grouping. Otherwise, when we use .groupby() and the values of B are grouped together during processing, the count value may not be correctly reset to 1 for the last entry.

If your data for the main grouping keys Cow and Lact are already naturally sorted during data construction, or have been sorted by instruction such as:

df = df.sort_values(['Cow', 'Lact'])

Then, we can simplify our codes, as follows:

(when data already sorted by [Cow, Lact]):

df['xmast'] = df.groupby('Cow').cumcount()+1
df['Lxmast'] = df.groupby(['Cow', 'Lact']).cumcount()+1

df['Adjusted'] = (df.groupby(['Cow', 'Lact'])
['DIM'].diff().abs().gt(7)
.groupby([df['Cow'], df['Lact']])
.cumsum()+1
)

Same result and output values in the 3 columns xmast, Lxmast and Adjusted

Pandas conditional rolling counter between 2 columns of boolean values

You can try:

df['count'] = (df.groupby(df.A.cumsum())['B'].cumsum() + df.groupby(df.B.cumsum())['A'].cumsum())

OUTPUT:

        A      B  count
0 False False 0
1 True False 1
2 False False 1
3 True False 2
4 False True 1
5 False False 1
6 True False 1
7 False True 1
8 False True 2
9 False False 2
10 False True 3

Pandas: conditional rolling count by group, counting the number of times current observation appeared in another column

I have a solution below but I am looking for a better one as column 'B' could potentially have many different observations making it quite slow.

for i in df['B'].unique():
df.loc[df['B']==i, 'count'] = df.where(df['B'].eq(i)).groupby(df['group'])['B'].transform(lambda x: x.rolling(3, min_periods=1).count().shift(fill_value=0))
df

B group count
0 X IT 0.0
1 X IT 1.0
2 Y IT 0.0
3 X MV 0.0
4 Y MV 0.0
5 Y MV 1.0
6 X IT 2.0
7 X MV 1.0
8 Y MV 2.0
9 Y IT 1.0
10 X IT 1.0
11 Y MV 2.0

Conditional running count in pandas based on conditions in 2 columns (counting number of people in a queue based on timestamps)

import pandas as pd
import time

df = pd.DataFrame([[23239,'1/1/2020 0:00','1/1/2020 0:40'],[51042,'1/1/2020 0:11','1/1/2020 0:42'],
[73373,'1/1/2020 0:15','1/1/2020 0:56'],[14222,'1/1/2020 0:22','1/1/2020 1:00'],
[27116,'1/1/2020 0:55','1/1/2020 1:15']],columns = ['ID','BOOKING_TIME','ENTRY_TIME'])
df = df.sort_values(by='ENTRY_TIME')
# Copy 1000 times
df = pd.concat([df for i in range(1000)])
df['BOOKING_TIME'] = pd.to_datetime(df['BOOKING_TIME'], format='%d/%m/%Y %H:%M')
df['ENTRY_TIME'] = pd.to_datetime(df['ENTRY_TIME'], format='%d/%m/%Y %H:%M')

# improvement code
start =time.time()
df['IN_QUEUE'] = df.index.map(lambda index_value: (df['ENTRY_TIME'].values[:index_value+1] > df['BOOKING_TIME'].values[index_value]).sum())
end = time.time()
print('Running time: %s Seconds'%(end-start))
# Running time: 0.04886770248413086 Seconds

Numpy is the foundation of pandas, Using numpy will be much faster.
Does this meet your requirements.

Conditional Running Count Pandas

Seems you need to group by ID, then use cumsum to count the occurrences of B:

cond = df.before == 'B'
df['time_on_b'] = cond.groupby(df.ID).cumsum().where(cond, 0).astype(int)
df
# ID before after time_on_b
#0 1 A A 0
#1 1 B B 1
#2 1 B B 2
#3 2 A A 0
#4 2 A A 0
#5 3 B B 1
#6 4 A A 0

Pandas: using too much memory with conditional rolling count

I developed another solution to your question, which is based on group-by and the usage of one-hot encoding (get_dummy).

Here's the code:

df = pd.DataFrame({'B': ['X', 'X' , 'Y', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'X', 'Y'],
'group': ["IT", "IT", "IT", "MV", "MV", "MV", "IT", "MV", "MV", "IT", "IT", "MV"]})

# add a one-hot encoding to the dataframe.
t = pd.concat([df, pd.get_dummies(df.B)], axis=1)

t.index.name = "inx"

# do a rolling sum of 4. It's the past 3, plus 1.
t = t.groupby("group").rolling(4, min_periods = 1).sum()
t = t.reset_index().set_index("inx").sort_index()

# remove the extra '1' from the rolling result.
t.loc[:, ["X", "Y"]] = t.loc[:, ["X", "Y"]] - 1

# merge back the results with the original dataframe.
t = pd.concat([df, t[["X", "Y"]]], axis=1)

# create a 'count' column which is based on the values of 'B'.
t["count"] = t.lookup(t.index, t.B )

The output is:

     B group    X    Y  count
inx
0 X IT 0.0 -1.0 0.0
1 X IT 1.0 -1.0 1.0
2 Y IT 1.0 0.0 0.0
3 X MV 0.0 -1.0 0.0
4 Y MV 0.0 0.0 0.0
5 Y MV 0.0 1.0 1.0
6 X IT 2.0 0.0 2.0
7 X MV 1.0 1.0 1.0
8 Y MV 0.0 2.0 2.0
9 Y IT 1.0 1.0 1.0
10 X IT 1.0 1.0 1.0
11 Y MV 0.0 2.0 2.0

All in one line:

df['count'] = (pd.concat([df, df['B'].str.get_dummies()], axis=1)
.groupby('group')
.rolling(4, min_periods=1)
.sum()
.sort_index(level=1)
.reset_index(drop=True)
.lookup(df.index, df['B']) - 1)

Python Pandas : Conditional rolling count

You can count the cumulative sum of a Boolean series indicating when your series equals a value:

df['id'] = df['type'].eq('a').cumsum()

Pandas: Conditional Rolling Block Count

Let's try:

s = df.Step.where(df.Step.eq(2))
df['Run_count'] = s.dropna().groupby(s.isna().cumsum()).ngroup()+1

Output:

    Time  Step  Run_count
0 0 0 NaN
1 1 1 NaN
2 2 2 1.0
3 3 2 1.0
4 4 2 1.0
5 5 3 NaN
6 6 0 NaN
7 7 1 NaN
8 8 2 2.0
9 9 2 2.0
10 10 2 2.0
11 11 3 NaN


Related Topics



Leave a reply



Submit