Group by and Conditionally Count

How to do a conditional count after groupby on a Pandas Dataframe?

I think you need add condition first:

#if need also category c with no values of 'one'
df11=df.groupby('key1')['key2'].apply(lambda x: (x=='one').sum()).reset_index(name='count')
print (df11)
  key1  count
0    a      2
1    b      1
2    c      0

Or use categorical with key1, then missing value is added by size:

df['key1'] = df['key1'].astype('category')
df1 = df[df['key2'] == 'one'].groupby(['key1']).size().reset_index(name='count') 
print (df1)
  key1  count
0    a      2
1    b      1
2    c      0

If need all combinations:

df2 = df.groupby(['key1', 'key2']).size().reset_index(name='count') 
print (df2)
  key1 key2  count
0    a  one      2
1    a  two      1
2    b  one      1
3    b  two      1
4    c  two      1

df3 = df.groupby(['key1', 'key2']).size().unstack(fill_value=0)
print (df3)
key2  one  two
key1          
a       2    1
b       1    1
c       0    1

Conditional counts in pandas group by

You can try replace your 2 lines with .count() to .sum(), as follows:

d['Zero_Balance_days'] = (x['Balance'] < 0).sum() 
d['Over_Credit_days'] = (x['Balance'] > x['Max Credit']).sum()

.count() returns number of non-NA/null observations in the Series of boolean index while both True/False are not NA/null and will be counted as well.

.sum() returns the sum of entries of True since True is interpreted as 1 while False is interpreted as 0 in the summation.

Group by and conditionally count

If you want to add it as a column you can do:

DDcomplete %>% group_by(ST) %>% mutate(count = sum(dist.km == 0))

Or if you just want the counts per state:

DDcomplete %>% group_by(ST) %>% summarise(count = sum(dist.km == 0))

Actually, you were very close to the solution. Your code

state= DDcomplete %>%
    group_by(ST) %>%
    summarize(zero = sum(DDcomplete$dist.km==0, na.rm = TRUE))

is almost correct. You can remove the DDcomplete$ from within the call to sum because within dplyr chains, you can access variables directly.

Also note that by using summarise, you will condense your data frame to 1 row per group with only the grouping column(s) and whatever you computed inside the summarise. If you just want to add a column with the counts, you can use mutate as I did in my answer.

If you're only interested in positive counts, you could also use dplyr's count function together with filter to first subset the data:

filter(DDcomplete, dist.km == 0) %>% count(ST)

Conditional Count on a field

I think you may be after

select 
    jobID, JobName,
    sum(case when Priority = 1 then 1 else 0 end) as priority1,
    sum(case when Priority = 2 then 1 else 0 end) as priority2,
    sum(case when Priority = 3 then 1 else 0 end) as priority3,
    sum(case when Priority = 4 then 1 else 0 end) as priority4,
    sum(case when Priority = 5 then 1 else 0 end) as priority5
from
    Jobs
group by 
    jobID, JobName

However I am uncertain if you need to the jobID and JobName in your results if so remove them and remove the group by,

Conditionally count values in a pandas groupby object

I think you need:

np.random.seed(6)

N = 15
master_lso = pd.DataFrame({'lsoa11': np.random.randint(4, size=N),
                           'TOTAL_FLOOR_AREA': np.random.choice([0,30,40,50], size=N)})
master_lso['lsoa11'] = 'a' + master_lso['lsoa11'].astype(str)
print (master_lso)
    TOTAL_FLOOR_AREA lsoa11
0                 40     a2
1                 50     a1
2                 30     a3
3                  0     a0
4                 40     a2
5                  0     a1
6                 30     a3
7                  0     a2
8                 40     a0
9                  0     a2
10                 0     a1
11                50     a1
12                50     a3
13                40     a1
14                30     a1

First filter rows by condition by boolean indexing - it is faster before grouping, because less rows.

df = master_lso[master_lso['TOTAL_FLOOR_AREA'] > 30]
print (df)
    TOTAL_FLOOR_AREA lsoa11
0                 40     a2
1                 50     a1
4                 40     a2
8                 40     a0
11                50     a1
12                50     a3
13                40     a1

Then groupby and aggregate size:

df1 = df.groupby('lsoa11')['TOTAL_FLOOR_AREA'].size().reset_index(name='Count')
print (df1)
  lsoa11  Count
0     a0      1
1     a1      3
2     a2      2
3     a3      1

Conditional Counting in Groupby Pandas

Using groupby.agg with a dictionary of calculations:

from collections import OrderedDict

df.columns=['ticker', 'date', 'accuracy']

groupers = OrderedDict([('mean', np.mean),
                        ('>_0.20_pct', lambda x: (x > 0.20).sum()/len(x)),
                        ('>_0.50_pct', lambda x: (x > 0.50).sum()/len(x)),
                        ('>_0.70_pct', lambda x: (x > 0.70).sum()/len(x))])

res = df.groupby('ticker')['accuracy'].agg(groupers)

print(res)

            mean  >_0.20_pct  >_0.50_pct  >_0.70_pct
ticker                                              
AAAP    0.806244    1.000000    0.666667    0.666667
AAL     0.298683    0.666667    0.000000    0.000000
ZAYO    0.164886    0.333333    0.000000    0.000000
ZBH     0.103811    0.000000    0.000000    0.000000

How get conditional count while using group by in mysql?

I think you just want conditional aggregation:

SELECT DATE_FORMAT(aua.request_date,'%b') as month , 
       YEAR(aua.request_date) as year,
       DATE_FORMAT(aua.request_date, '%Y-%m-%d') as date,
       COUNT(aua.audit_id) as total_trans ,
       SUM(aua.response_status <> 'P') as total_failure,
       SUM(aua.response_status = 'P') as total_successful,
       aua.device_code as deviceCode
FROM audit_webservice_aua aua 
WHERE DATE_FORMAT(aua.request_date, '%Y-%m-%d') between '2020-04-16' and '2020-07-17' 
GROUP BY month, year, date, deviceCode ;

I would also advise you to change the WHERE clause to:

WHERE aua.request_date >= '2020-04-16' AND
      aua.request_date >= '2020-07-18'

Python pandas - Groupby + Conditional count of cells values

This works

df['number_of_parcels'] = df.groupby('type').apply(lambda x: x.apply(lambda y:(
    (x['departure_time'] >= y['departure_time']) & (x['departure_time'] < y['arrival_time'])
    ).sum(), axis=1)).droplevel(level=0)
df

Out:

  Parcel_id departure_time arrival_time type  number_of_parcels
0      id_1          07:00        07:30   TV                  4
1      id_2          07:00        07:15   PC                  4
2      id_3          07:05        07:22   PC                  4
3      id_4          07:10        07:45   TV                  4
4      id_5          07:15        07:50   TV                  2
5      id_6          07:10        07:26   PC                  3
6      id_7          07:40        08:10   TV                  1
7      id_8          07:14        07:46   TV                  3
8      id_9          07:14        07:32   PC                  2
9     id_10          07:15        07:30   PC                  1

Conditionally count rows between two datetime columns by group in R

Here is a data.table approach using shift:

DF <- structure(list(UserId = c("0buhGq", "0buhGq", "0buhGq", "0buhGq", 
"0buhGq", "0buhGq", "0buhGq", "0buhGq", "0buhGq", "0ulN53", "0ulN53", 
"0ulN53", "0ulN53", "0ulN53", "0ulN53", "0ulN53", "0ulN53", "0wzMkn", 
"0wzMkn", "0wzMkn", "2diLwk", "2diLwk", "2diLwk", "2diLwk", "2diLwk", 
"2diLwk", "2diLwk", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", 
"36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", 
"36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", 
"36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", 
"36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", 
"3942Wm", "3OUdi4", "4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY", 
"4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY", "52LXbG", 
"52LXbG", "52LXbG", "52LXbG", "52LXbG", "52LXbG", "52LXbG", "64gfuI", 
"64gfuI", "6KpHap", "6KpHap", "6KpHap", "6vHa6q", "72MKAc", "72MKAc", 
"72MKAc", "72MKAc", "8RcC8m", "8RcC8m", "8RcC8m", "8RcC8m", "98vV9L", 
"98vV9L", "98vV9L", "98vV9L", "98vV9L", "98vV9L", "98vV9L", "98vV9L", 
"98vV9L", "98vV9L", "9PF5pW", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", 
"aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", 
"aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", 
"aGsBU0", "aGsBU0", "aGsBU0", "AUQYey", "AUQYey", "AUQYey", "AUQYey", 
"AUQYey", "B0s81w", "B0s81w", "BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw", 
"BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw", 
"BWY8Fi", "c3qDX8", "c3qDX8", "c3qDX8", "c3qDX8", "c3qDX8", "c3qDX8", 
"c3qDX8", "c3qDX8", "cIS2B3", "CNTp7f", "CNTp7f", "CNTp7f", "dHBGWg", 
"dHBGWg", "dHBGWg", "dHBGWg", "dQ1kCz", "dQ1kCz", "e2aP4W", "EEnp7x", 
"EEnp7x", "EEnp7x", "eJL9eB", "eJL9eB", "eJL9eB", "eXKdph", "exxLGA", 
"FE1Gg2", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", 
"fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", 
"fPqjLw", "GfLwK1", "GfLwK1", "GfLwK1", "GQMw74", "GQMw74", "GSjzKw", 
"hA3kcA", "hA3kcA", "hA3kcA", "hA3kcA", "hA3kcA", "hA3kcA", "hA3kcA"
), starttimes = structure(c(1612225495, 1612232905, 1612239592, 
1612239786, 1612240802, 1612244089, 1612245854, 1612247323, 1612247337, 
1612224159, 1612225263, 1612226339, 1612227886, 1612232803, 1612233071, 
1612240903, 1612245572, 1612224249, 1612230907, 1612242993, 1612224038, 
1612225511, 1612226912, 1612232868, 1612235417, 1612246900, 1612247329, 
1612224511, 1612224732, 1612224741, 1612225124, 1612225673, 1612226602, 
1612226633, 1612228148, 1612228574, 1612229782, 1612230617, 1612231030, 
1612231085, 1612231122, 1612231156, 1612231939, 1612234987, 1612236491, 
1612238398, 1612240919, 1612240941, 1612241997, 1612242078, 1612242957, 
1612243896, 1612243977, 1612244150, 1612245811, 1612246720, 1612246923, 
1612247034, 1612247123, 1612247570, 1612227309, 1612243829, 1612225263, 
1612225797, 1612225814, 1612226255, 1612227017, 1612229235, 1612239150, 
1612241630, 1612247047, 1612247085, 1612247217, 1612224887, 1612225648, 
1612227079, 1612230380, 1612245565, 1612246181, 1612246806, 1612241890, 
1612245854, 1612225689, 1612229282, 1612242957, 1612239156, 1612224511, 
1612224522, 1612225166, 1612247820, 1612224358, 1612231137, 1612235046, 
1612247226, 1612224076, 1612224232, 1612225673, 1612226912, 1612228533, 
1612229434, 1612231023, 1612232915, 1612235398, 1612243431, 1612239692, 
1612224286, 1612224309, 1612225511, 1612225845, 1612225919, 1612226267, 
1612228574, 1612228937, 1612230201, 1612230395, 1612230838, 1612231030, 
1612231657, 1612232159, 1612232761, 1612235182, 1612239699, 1612241642, 
1612243227, 1612245164, 1612247272, 1612224065, 1612235409, 1612241800, 
1612246929, 1612246937, 1612226922, 1612228937, 1612224085, 1612224662, 
1612224856, 1612225316, 1612226938, 1612227404, 1612227447, 1612228412, 
1612228597, 1612235046, 1612241997, 1612227060, 1612224348, 1612227235, 
1612234517, 1612240919, 1612245430, 1612245749, 1612247129, 1612247337, 
1612228148, 1612226677, 1612231105, 1612243991, 1612225685, 1612225878, 
1612227191, 1612229187, 1612224085, 1612230981, 1612224286, 1612224067, 
1612224427, 1612241971, 1612227060, 1612228351, 1612231023, 1612241649, 
1612243036, 1612226655, 1612224398, 1612225668, 1612226677, 1612227337, 
1612227390, 1612231017, 1612231085, 1612232375, 1612232761, 1612240941, 
1612241749, 1612245893, 1612247016, 1612247031, 1612224695, 1612230535, 
1612240808, 1612225648, 1612229235, 1612244130, 1612224443, 1612224480, 
1612224567, 1612225154, 1612227317, 1612227547, 1612229762), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), endtimes = structure(c(1612227885, 
1612232909, 1612239690, 1612239791, 1612240807, 1612244114, 1612245870, 
1612247329, 1612247346, 1612225198, 1612225495, 1612226391, 1612227899, 
1612243673, 1612234333, 1612240908, 1612245586, 1612224286, 1612230913, 
1612243008, 1612224046, 1612225550, 1612226918, 1612234967, 1612235432, 
1612246917, 1612247337, 1612224520, 1612224741, 1612224878, 1612225246, 
1612225680, 1612226607, 1612226656, 1612228412, 1612228586, 1612230487, 
1612230631, 1612231203, 1612231105, 1612231137, 1612231171, 1612232151, 
1612235005, 1612236571, 1612245037, 1612240935, 1612240947, 1612242008, 
1612242118, 1612243144, 1612244074, 1612243986, 1612244278, 1612245846, 
1612247815, 1612246929, 1612247041, 1612247127, 1612247739, 1612227317, 
1612243866, 1612225495, 1612225813, 1612225845, 1612226266, 1612227024, 
1612229282, 1612239156, 1612241680, 1612247059, 1612247094, 1612247222, 
1612224906, 1612225661, 1612227235, 1612230405, 1612245572, 1612247552, 
1612246888, 1612241961, 1612245870, 1612225695, 1612229761, 1612243144, 
1612239162, 1612224533, 1612224537, 1612225185, 1612247833, 1612224374, 
1612231143, 1612246048, 1612247226, 1612224152, 1612225636, 1612225680, 
1612226918, 1612230888, 1612229722, 1612231029, 1612232966, 1612235409, 
1612243995, 1612239696, 1612224377, 1612224673, 1612225550, 1612230662, 
1612225925, 1612226277, 1612228586, 1612228955, 1612230212, 1612230609, 
1612231489, 1612231203, 1612231672, 1612232375, 1612232803, 1612235187, 
1612239704, 1612241649, 1612243227, 1612245199, 1612247272, 1612224156, 
1612235413, 1612241818, 1612246933, 1612246943, 1612226930, 1612228955, 
1612224621, 1612225487, 1612224892, 1612225423, 1612226965, 1612227417, 
1612227461, 1612228444, 1612228607, 1612246048, 1612242008, 1612227178, 
1612224353, 1612227245, 1612238334, 1612240935, 1612245439, 1612245785, 
1612247135, 1612247346, 1612228163, 1612227404, 1612231121, 1612244002, 
1612225727, 1612227245, 1612227204, 1612229226, 1612224203, 1612230998, 
1612224377, 1612224081, 1612225814, 1612241976, 1612227178, 1612228356, 
1612231029, 1612241740, 1612243041, 1612227231, 1612224426, 1612225683, 
1612227404, 1612227346, 1612227397, 1612231022, 1612231105, 1612232381, 
1612232803, 1612240947, 1612241755, 1612245899, 1612247021, 1612247220, 
1612225301, 1612230673, 1612240884, 1612225661, 1612229282, 1612244136, 
1612224582, 1612224662, 1612233387, 1612226208, 1612227329, 1612228048, 
1612229782), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
200L), class = "data.frame")

library(data.table)

setDT(DF)

DF <- unique(DF[, session_id := cumsum(starttimes-shift(endtimes, n=1L, fill=first(starttimes), type="lag") >= 3600), by = UserId][
                , c("count", "starttimes", "endtimes") := .(.N, min(starttimes), max(endtimes)), by = .(UserId, session_id)][
                , session_id := NULL])

head(DF, 10)

> head(DF, 10)
    UserId          starttimes            endtimes count
 1: 0buhGq 2021-02-02 00:24:55 2021-02-02 01:04:45     1
 2: 0buhGq 2021-02-02 02:28:25 2021-02-02 02:28:29     1
 3: 0buhGq 2021-02-02 04:19:52 2021-02-02 06:29:06     7
 4: 0ulN53 2021-02-02 00:02:39 2021-02-02 01:04:59     4
 5: 0ulN53 2021-02-02 02:26:43 2021-02-02 05:27:53     2
 6: 0ulN53 2021-02-02 04:41:43 2021-02-02 04:41:48     1
 7: 0ulN53 2021-02-02 05:59:32 2021-02-02 05:59:46     1
 8: 0wzMkn 2021-02-02 00:04:09 2021-02-02 00:04:46     1
 9: 0wzMkn 2021-02-02 01:55:07 2021-02-02 01:55:13     1
10: 0wzMkn 2021-02-02 05:16:33 2021-02-02 05:16:48     1

Benchmark:

Unit: milliseconds
          expr       min        lq      mean    median        uq       max neval
           shs 73.507502 73.507502 73.507502 73.507502 73.507502 73.507502     1
 ismirsehregal  9.535402  9.535402  9.535402  9.535402  9.535402  9.535402     1

Benchmark code:

library(lubridate)
library(dplyr)
library(data.table)
library(microbenchmark)

DFdata.table <- DFdplyr <- DF

microbenchmark(shs = {
  DFdplyr %>% 
    group_by(UserId) %>% 
    mutate(grp = {starttimes - lag(endtimes)} %>% 
             {. >= hours(1)} %>%
             ifelse(is.na(.), 0, .) %>% 
             cumsum()
    ) %>% 
    group_by(UserId, grp) %>% 
    summarize(starttimes = first(starttimes), 
              endtimes = last(endtimes),
              count = n())
},
ismirsehregal = {
  setDT(DF)
  DFdata.table <- unique(DFdata.table[, session_id := cumsum(starttimes-shift(endtimes, n=1L, fill=first(starttimes), type="lag") >= 3600), by = UserId][
    , c("count", "starttimes", "endtimes") := .(.N, min(starttimes), max(endtimes)), by = .(UserId, session_id)][
      , session_id := NULL])
}, times = 1L)