Pandas Conditional Rolling Count
Actually, your codes to set up xmast
and Lxmast
can be much simplified if you had used the solution with the highest upvotes in the referenced question.
Renaming your dataframe cowmast
to df
, you can set up xmast
as follows:
df['xmast'] = df.groupby((df['Cow'] != df['Cow'].shift(1)).cumsum()).cumcount()+1
Similarly, to set up Lxmast
, you can use:
df['Lxmast'] = (df.groupby([(df['Cow'] != df['Cow'].shift(1)).cumsum(),
(df['Lact'] != df['Lact'].shift()).cumsum()])
.cumcount()+1
)
Data Input
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
df = cowmast
Output
print(df)
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
Now, continue with the last part of your requirement highlighted in bold below:
What I would like to do is restart the count for each cow (cow)
lactation (Lact) and only increment the count when the number of days
(DIM) between rows is more than 7.
we can do it as follows:
To make the codes more readable, let's define 2 grouping sequences for the codes we have so far:
m_Cow = (df['Cow'] != df['Cow'].shift()).cumsum()
m_Lact = (df['Lact'] != df['Lact'].shift()).cumsum()
Then, we can rewrite the codes to set up Lxmast
in a more readable format, as follows:
df['Lxmast'] = df.groupby([m_Cow, m_Lact]).cumcount()+1
Now, turn to the main works here. Let's say we create another new column Adjusted
for it:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().abs().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Result:
print(df)
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
Here, after df.groupby([m_Cow, m_Lact])
, we take the column DIM
and check for each row's difference with previous row by .diff()
and take the absolute value by .abs()
, then check whether it is > 7 by .gt(7)
in the code fragment ['DIM'].diff().abs().gt(7)
. We then group by the same grouping again .groupby([m_Cow, m_Lact])
since this 3rd condition is within the grouping of the first 2 conditions. The final step we use .cumsum()
on the 3rd condition, so that only when the 3rd condition is true we increment the count.
Just in case you want to increment the count only when the DIM
is inreased by > 7 only (e.g. 70 to 78) and exclude the case decreased by > 7 (not from 78 to 70), you can remove the .abs()
part in the codes above:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Edit (Possible simplification depending on your data sequence)
As your sample data have the main grouping keys Cow
and Lact
somewhat already in sorted sequence, there's opportunity for further simplification of the codes.
Different from the sample data from the referenced question, where:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
Here, the last B
in the last row is separated from other B
's and it required the count be reset to 1 rather than continuing from the last count
of 2 of the previous B
(to become 3). Hence, the grouping needs to compare current row with previous row to get the correct grouping. Otherwise, when we use .groupby()
and the values of B
are grouped together during processing, the count
value may not be correctly reset to 1 for the last entry.
If your data for the main grouping keys Cow
and Lact
are already naturally sorted during data construction, or have been sorted by instruction such as:
df = df.sort_values(['Cow', 'Lact'])
Then, we can simplify our codes, as follows:
(when data already sorted by [Cow
, Lact
]):
df['xmast'] = df.groupby('Cow').cumcount()+1
df['Lxmast'] = df.groupby(['Cow', 'Lact']).cumcount()+1
df['Adjusted'] = (df.groupby(['Cow', 'Lact'])
['DIM'].diff().abs().gt(7)
.groupby([df['Cow'], df['Lact']])
.cumsum()+1
)
Same result and output values in the 3 columns xmast
, Lxmast
and Adjusted
Pandas conditional rolling counter between 2 columns of boolean values
You can try:
df['count'] = (df.groupby(df.A.cumsum())['B'].cumsum() + df.groupby(df.B.cumsum())['A'].cumsum())
OUTPUT:
A B count
0 False False 0
1 True False 1
2 False False 1
3 True False 2
4 False True 1
5 False False 1
6 True False 1
7 False True 1
8 False True 2
9 False False 2
10 False True 3
Pandas: conditional rolling count by group, counting the number of times current observation appeared in another column
I have a solution below but I am looking for a better one as column 'B' could potentially have many different observations making it quite slow.
for i in df['B'].unique():
df.loc[df['B']==i, 'count'] = df.where(df['B'].eq(i)).groupby(df['group'])['B'].transform(lambda x: x.rolling(3, min_periods=1).count().shift(fill_value=0))
df
B group count
0 X IT 0.0
1 X IT 1.0
2 Y IT 0.0
3 X MV 0.0
4 Y MV 0.0
5 Y MV 1.0
6 X IT 2.0
7 X MV 1.0
8 Y MV 2.0
9 Y IT 1.0
10 X IT 1.0
11 Y MV 2.0
Conditional running count in pandas based on conditions in 2 columns (counting number of people in a queue based on timestamps)
import pandas as pd
import time
df = pd.DataFrame([[23239,'1/1/2020 0:00','1/1/2020 0:40'],[51042,'1/1/2020 0:11','1/1/2020 0:42'],
[73373,'1/1/2020 0:15','1/1/2020 0:56'],[14222,'1/1/2020 0:22','1/1/2020 1:00'],
[27116,'1/1/2020 0:55','1/1/2020 1:15']],columns = ['ID','BOOKING_TIME','ENTRY_TIME'])
df = df.sort_values(by='ENTRY_TIME')
# Copy 1000 times
df = pd.concat([df for i in range(1000)])
df['BOOKING_TIME'] = pd.to_datetime(df['BOOKING_TIME'], format='%d/%m/%Y %H:%M')
df['ENTRY_TIME'] = pd.to_datetime(df['ENTRY_TIME'], format='%d/%m/%Y %H:%M')
# improvement code
start =time.time()
df['IN_QUEUE'] = df.index.map(lambda index_value: (df['ENTRY_TIME'].values[:index_value+1] > df['BOOKING_TIME'].values[index_value]).sum())
end = time.time()
print('Running time: %s Seconds'%(end-start))
# Running time: 0.04886770248413086 Seconds
Numpy is the foundation of pandas, Using numpy will be much faster.
Does this meet your requirements.
Conditional Running Count Pandas
Seems you need to group by ID
, then use cumsum
to count the occurrences of B
:
cond = df.before == 'B'
df['time_on_b'] = cond.groupby(df.ID).cumsum().where(cond, 0).astype(int)
df
# ID before after time_on_b
#0 1 A A 0
#1 1 B B 1
#2 1 B B 2
#3 2 A A 0
#4 2 A A 0
#5 3 B B 1
#6 4 A A 0
Pandas: using too much memory with conditional rolling count
I developed another solution to your question, which is based on group-by and the usage of one-hot encoding (get_dummy
).
Here's the code:
df = pd.DataFrame({'B': ['X', 'X' , 'Y', 'X', 'Y', 'Y', 'X', 'X', 'Y', 'Y', 'X', 'Y'],
'group': ["IT", "IT", "IT", "MV", "MV", "MV", "IT", "MV", "MV", "IT", "IT", "MV"]})
# add a one-hot encoding to the dataframe.
t = pd.concat([df, pd.get_dummies(df.B)], axis=1)
t.index.name = "inx"
# do a rolling sum of 4. It's the past 3, plus 1.
t = t.groupby("group").rolling(4, min_periods = 1).sum()
t = t.reset_index().set_index("inx").sort_index()
# remove the extra '1' from the rolling result.
t.loc[:, ["X", "Y"]] = t.loc[:, ["X", "Y"]] - 1
# merge back the results with the original dataframe.
t = pd.concat([df, t[["X", "Y"]]], axis=1)
# create a 'count' column which is based on the values of 'B'.
t["count"] = t.lookup(t.index, t.B )
The output is:
B group X Y count
inx
0 X IT 0.0 -1.0 0.0
1 X IT 1.0 -1.0 1.0
2 Y IT 1.0 0.0 0.0
3 X MV 0.0 -1.0 0.0
4 Y MV 0.0 0.0 0.0
5 Y MV 0.0 1.0 1.0
6 X IT 2.0 0.0 2.0
7 X MV 1.0 1.0 1.0
8 Y MV 0.0 2.0 2.0
9 Y IT 1.0 1.0 1.0
10 X IT 1.0 1.0 1.0
11 Y MV 0.0 2.0 2.0
All in one line:
df['count'] = (pd.concat([df, df['B'].str.get_dummies()], axis=1)
.groupby('group')
.rolling(4, min_periods=1)
.sum()
.sort_index(level=1)
.reset_index(drop=True)
.lookup(df.index, df['B']) - 1)
Python Pandas : Conditional rolling count
You can count the cumulative sum of a Boolean series indicating when your series equals a value:
df['id'] = df['type'].eq('a').cumsum()
Pandas: Conditional Rolling Block Count
Let's try:
s = df.Step.where(df.Step.eq(2))
df['Run_count'] = s.dropna().groupby(s.isna().cumsum()).ngroup()+1
Output:
Time Step Run_count
0 0 0 NaN
1 1 1 NaN
2 2 2 1.0
3 3 2 1.0
4 4 2 1.0
5 5 3 NaN
6 6 0 NaN
7 7 1 NaN
8 8 2 2.0
9 9 2 2.0
10 10 2 2.0
11 11 3 NaN
Related Topics
How to Make a Selenium Script Undetectable Using Geckodriver and Firefox Through Python
How to Determine Whether a Year Is a Leap Year
Pip Cannot Uninstall <Package>: "It Is a Distutils Installed Project"
Prime Number Check Acts Strange
How to Disable Logging on the Standard Error Stream
Python Memoising/Deferred Lookup Property Decorator
Plotting a Decision Boundary Separating 2 Classes Using Matplotlib's Pyplot
Ssl.Sslerror: Tlsv1 Alert Protocol Version
How to Calculate Mean Values Grouped on Another Column in Pandas
Save Classifier to Disk in Scikit-Learn
Python Dictionary from an Object's Fields
List Directory Tree Structure in Python
Adding a Background Image to a Plot
Beginner Python: Reading and Writing to the Same File
Python Replace Single Backslash with Double Backslash