How to do a conditional count after groupby on a Pandas Dataframe?
I think you need add condition first:
#if need also category c with no values of 'one'
df11=df.groupby('key1')['key2'].apply(lambda x: (x=='one').sum()).reset_index(name='count')
print (df11)
key1 count
0 a 2
1 b 1
2 c 0
Or use categorical
with key1
, then missing value is added by size
:
df['key1'] = df['key1'].astype('category')
df1 = df[df['key2'] == 'one'].groupby(['key1']).size().reset_index(name='count')
print (df1)
key1 count
0 a 2
1 b 1
2 c 0
If need all combinations:
df2 = df.groupby(['key1', 'key2']).size().reset_index(name='count')
print (df2)
key1 key2 count
0 a one 2
1 a two 1
2 b one 1
3 b two 1
4 c two 1
df3 = df.groupby(['key1', 'key2']).size().unstack(fill_value=0)
print (df3)
key2 one two
key1
a 2 1
b 1 1
c 0 1
Conditional counts in pandas group by
You can try replace your 2 lines with .count()
to .sum()
, as follows:
d['Zero_Balance_days'] = (x['Balance'] < 0).sum()
d['Over_Credit_days'] = (x['Balance'] > x['Max Credit']).sum()
.count()
returns number of non-NA/null observations in the Series of boolean index while both True
/False
are not NA/null and will be counted as well.
.sum()
returns the sum of entries of True
since True
is interpreted as 1
while False
is interpreted as 0
in the summation.
Group by and conditionally count
If you want to add it as a column you can do:
DDcomplete %>% group_by(ST) %>% mutate(count = sum(dist.km == 0))
Or if you just want the counts per state:
DDcomplete %>% group_by(ST) %>% summarise(count = sum(dist.km == 0))
Actually, you were very close to the solution. Your code
state= DDcomplete %>%
group_by(ST) %>%
summarize(zero = sum(DDcomplete$dist.km==0, na.rm = TRUE))
is almost correct. You can remove the DDcomplete$
from within the call to sum
because within dplyr chains, you can access variables directly.
Also note that by using summarise
, you will condense your data frame to 1 row per group with only the grouping column(s) and whatever you computed inside the summarise
. If you just want to add a column with the counts, you can use mutate as I did in my answer.
If you're only interested in positive counts, you could also use dplyr's count
function together with filter
to first subset the data:
filter(DDcomplete, dist.km == 0) %>% count(ST)
Conditional Count on a field
I think you may be after
select
jobID, JobName,
sum(case when Priority = 1 then 1 else 0 end) as priority1,
sum(case when Priority = 2 then 1 else 0 end) as priority2,
sum(case when Priority = 3 then 1 else 0 end) as priority3,
sum(case when Priority = 4 then 1 else 0 end) as priority4,
sum(case when Priority = 5 then 1 else 0 end) as priority5
from
Jobs
group by
jobID, JobName
However I am uncertain if you need to the jobID and JobName in your results if so remove them and remove the group by,
Conditionally count values in a pandas groupby object
I think you need:
np.random.seed(6)
N = 15
master_lso = pd.DataFrame({'lsoa11': np.random.randint(4, size=N),
'TOTAL_FLOOR_AREA': np.random.choice([0,30,40,50], size=N)})
master_lso['lsoa11'] = 'a' + master_lso['lsoa11'].astype(str)
print (master_lso)
TOTAL_FLOOR_AREA lsoa11
0 40 a2
1 50 a1
2 30 a3
3 0 a0
4 40 a2
5 0 a1
6 30 a3
7 0 a2
8 40 a0
9 0 a2
10 0 a1
11 50 a1
12 50 a3
13 40 a1
14 30 a1
First filter rows by condition by boolean indexing
- it is faster before grouping, because less rows.
df = master_lso[master_lso['TOTAL_FLOOR_AREA'] > 30]
print (df)
TOTAL_FLOOR_AREA lsoa11
0 40 a2
1 50 a1
4 40 a2
8 40 a0
11 50 a1
12 50 a3
13 40 a1
Then groupby
and aggregate size
:
df1 = df.groupby('lsoa11')['TOTAL_FLOOR_AREA'].size().reset_index(name='Count')
print (df1)
lsoa11 Count
0 a0 1
1 a1 3
2 a2 2
3 a3 1
Conditional Counting in Groupby Pandas
Using groupby.agg
with a dictionary of calculations:
from collections import OrderedDict
df.columns=['ticker', 'date', 'accuracy']
groupers = OrderedDict([('mean', np.mean),
('>_0.20_pct', lambda x: (x > 0.20).sum()/len(x)),
('>_0.50_pct', lambda x: (x > 0.50).sum()/len(x)),
('>_0.70_pct', lambda x: (x > 0.70).sum()/len(x))])
res = df.groupby('ticker')['accuracy'].agg(groupers)
print(res)
mean >_0.20_pct >_0.50_pct >_0.70_pct
ticker
AAAP 0.806244 1.000000 0.666667 0.666667
AAL 0.298683 0.666667 0.000000 0.000000
ZAYO 0.164886 0.333333 0.000000 0.000000
ZBH 0.103811 0.000000 0.000000 0.000000
How get conditional count while using group by in mysql?
I think you just want conditional aggregation:
SELECT DATE_FORMAT(aua.request_date,'%b') as month ,
YEAR(aua.request_date) as year,
DATE_FORMAT(aua.request_date, '%Y-%m-%d') as date,
COUNT(aua.audit_id) as total_trans ,
SUM(aua.response_status <> 'P') as total_failure,
SUM(aua.response_status = 'P') as total_successful,
aua.device_code as deviceCode
FROM audit_webservice_aua aua
WHERE DATE_FORMAT(aua.request_date, '%Y-%m-%d') between '2020-04-16' and '2020-07-17'
GROUP BY month, year, date, deviceCode ;
I would also advise you to change the WHERE
clause to:
WHERE aua.request_date >= '2020-04-16' AND
aua.request_date >= '2020-07-18'
Python pandas - Groupby + Conditional count of cells values
This works
df['number_of_parcels'] = df.groupby('type').apply(lambda x: x.apply(lambda y:(
(x['departure_time'] >= y['departure_time']) & (x['departure_time'] < y['arrival_time'])
).sum(), axis=1)).droplevel(level=0)
df
Out:
Parcel_id departure_time arrival_time type number_of_parcels
0 id_1 07:00 07:30 TV 4
1 id_2 07:00 07:15 PC 4
2 id_3 07:05 07:22 PC 4
3 id_4 07:10 07:45 TV 4
4 id_5 07:15 07:50 TV 2
5 id_6 07:10 07:26 PC 3
6 id_7 07:40 08:10 TV 1
7 id_8 07:14 07:46 TV 3
8 id_9 07:14 07:32 PC 2
9 id_10 07:15 07:30 PC 1
Conditionally count rows between two datetime columns by group in R
Here is a data.table
approach using shift
:
DF <- structure(list(UserId = c("0buhGq", "0buhGq", "0buhGq", "0buhGq",
"0buhGq", "0buhGq", "0buhGq", "0buhGq", "0buhGq", "0ulN53", "0ulN53",
"0ulN53", "0ulN53", "0ulN53", "0ulN53", "0ulN53", "0ulN53", "0wzMkn",
"0wzMkn", "0wzMkn", "2diLwk", "2diLwk", "2diLwk", "2diLwk", "2diLwk",
"2diLwk", "2diLwk", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB",
"36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB",
"36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB",
"36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB",
"36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB", "36NAIB",
"3942Wm", "3OUdi4", "4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY",
"4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY", "4hhSRY", "52LXbG",
"52LXbG", "52LXbG", "52LXbG", "52LXbG", "52LXbG", "52LXbG", "64gfuI",
"64gfuI", "6KpHap", "6KpHap", "6KpHap", "6vHa6q", "72MKAc", "72MKAc",
"72MKAc", "72MKAc", "8RcC8m", "8RcC8m", "8RcC8m", "8RcC8m", "98vV9L",
"98vV9L", "98vV9L", "98vV9L", "98vV9L", "98vV9L", "98vV9L", "98vV9L",
"98vV9L", "98vV9L", "9PF5pW", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0",
"aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0",
"aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0", "aGsBU0",
"aGsBU0", "aGsBU0", "aGsBU0", "AUQYey", "AUQYey", "AUQYey", "AUQYey",
"AUQYey", "B0s81w", "B0s81w", "BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw",
"BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw", "BPaJTw",
"BWY8Fi", "c3qDX8", "c3qDX8", "c3qDX8", "c3qDX8", "c3qDX8", "c3qDX8",
"c3qDX8", "c3qDX8", "cIS2B3", "CNTp7f", "CNTp7f", "CNTp7f", "dHBGWg",
"dHBGWg", "dHBGWg", "dHBGWg", "dQ1kCz", "dQ1kCz", "e2aP4W", "EEnp7x",
"EEnp7x", "EEnp7x", "eJL9eB", "eJL9eB", "eJL9eB", "eXKdph", "exxLGA",
"FE1Gg2", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw",
"fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw", "fPqjLw",
"fPqjLw", "GfLwK1", "GfLwK1", "GfLwK1", "GQMw74", "GQMw74", "GSjzKw",
"hA3kcA", "hA3kcA", "hA3kcA", "hA3kcA", "hA3kcA", "hA3kcA", "hA3kcA"
), starttimes = structure(c(1612225495, 1612232905, 1612239592,
1612239786, 1612240802, 1612244089, 1612245854, 1612247323, 1612247337,
1612224159, 1612225263, 1612226339, 1612227886, 1612232803, 1612233071,
1612240903, 1612245572, 1612224249, 1612230907, 1612242993, 1612224038,
1612225511, 1612226912, 1612232868, 1612235417, 1612246900, 1612247329,
1612224511, 1612224732, 1612224741, 1612225124, 1612225673, 1612226602,
1612226633, 1612228148, 1612228574, 1612229782, 1612230617, 1612231030,
1612231085, 1612231122, 1612231156, 1612231939, 1612234987, 1612236491,
1612238398, 1612240919, 1612240941, 1612241997, 1612242078, 1612242957,
1612243896, 1612243977, 1612244150, 1612245811, 1612246720, 1612246923,
1612247034, 1612247123, 1612247570, 1612227309, 1612243829, 1612225263,
1612225797, 1612225814, 1612226255, 1612227017, 1612229235, 1612239150,
1612241630, 1612247047, 1612247085, 1612247217, 1612224887, 1612225648,
1612227079, 1612230380, 1612245565, 1612246181, 1612246806, 1612241890,
1612245854, 1612225689, 1612229282, 1612242957, 1612239156, 1612224511,
1612224522, 1612225166, 1612247820, 1612224358, 1612231137, 1612235046,
1612247226, 1612224076, 1612224232, 1612225673, 1612226912, 1612228533,
1612229434, 1612231023, 1612232915, 1612235398, 1612243431, 1612239692,
1612224286, 1612224309, 1612225511, 1612225845, 1612225919, 1612226267,
1612228574, 1612228937, 1612230201, 1612230395, 1612230838, 1612231030,
1612231657, 1612232159, 1612232761, 1612235182, 1612239699, 1612241642,
1612243227, 1612245164, 1612247272, 1612224065, 1612235409, 1612241800,
1612246929, 1612246937, 1612226922, 1612228937, 1612224085, 1612224662,
1612224856, 1612225316, 1612226938, 1612227404, 1612227447, 1612228412,
1612228597, 1612235046, 1612241997, 1612227060, 1612224348, 1612227235,
1612234517, 1612240919, 1612245430, 1612245749, 1612247129, 1612247337,
1612228148, 1612226677, 1612231105, 1612243991, 1612225685, 1612225878,
1612227191, 1612229187, 1612224085, 1612230981, 1612224286, 1612224067,
1612224427, 1612241971, 1612227060, 1612228351, 1612231023, 1612241649,
1612243036, 1612226655, 1612224398, 1612225668, 1612226677, 1612227337,
1612227390, 1612231017, 1612231085, 1612232375, 1612232761, 1612240941,
1612241749, 1612245893, 1612247016, 1612247031, 1612224695, 1612230535,
1612240808, 1612225648, 1612229235, 1612244130, 1612224443, 1612224480,
1612224567, 1612225154, 1612227317, 1612227547, 1612229762), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), endtimes = structure(c(1612227885,
1612232909, 1612239690, 1612239791, 1612240807, 1612244114, 1612245870,
1612247329, 1612247346, 1612225198, 1612225495, 1612226391, 1612227899,
1612243673, 1612234333, 1612240908, 1612245586, 1612224286, 1612230913,
1612243008, 1612224046, 1612225550, 1612226918, 1612234967, 1612235432,
1612246917, 1612247337, 1612224520, 1612224741, 1612224878, 1612225246,
1612225680, 1612226607, 1612226656, 1612228412, 1612228586, 1612230487,
1612230631, 1612231203, 1612231105, 1612231137, 1612231171, 1612232151,
1612235005, 1612236571, 1612245037, 1612240935, 1612240947, 1612242008,
1612242118, 1612243144, 1612244074, 1612243986, 1612244278, 1612245846,
1612247815, 1612246929, 1612247041, 1612247127, 1612247739, 1612227317,
1612243866, 1612225495, 1612225813, 1612225845, 1612226266, 1612227024,
1612229282, 1612239156, 1612241680, 1612247059, 1612247094, 1612247222,
1612224906, 1612225661, 1612227235, 1612230405, 1612245572, 1612247552,
1612246888, 1612241961, 1612245870, 1612225695, 1612229761, 1612243144,
1612239162, 1612224533, 1612224537, 1612225185, 1612247833, 1612224374,
1612231143, 1612246048, 1612247226, 1612224152, 1612225636, 1612225680,
1612226918, 1612230888, 1612229722, 1612231029, 1612232966, 1612235409,
1612243995, 1612239696, 1612224377, 1612224673, 1612225550, 1612230662,
1612225925, 1612226277, 1612228586, 1612228955, 1612230212, 1612230609,
1612231489, 1612231203, 1612231672, 1612232375, 1612232803, 1612235187,
1612239704, 1612241649, 1612243227, 1612245199, 1612247272, 1612224156,
1612235413, 1612241818, 1612246933, 1612246943, 1612226930, 1612228955,
1612224621, 1612225487, 1612224892, 1612225423, 1612226965, 1612227417,
1612227461, 1612228444, 1612228607, 1612246048, 1612242008, 1612227178,
1612224353, 1612227245, 1612238334, 1612240935, 1612245439, 1612245785,
1612247135, 1612247346, 1612228163, 1612227404, 1612231121, 1612244002,
1612225727, 1612227245, 1612227204, 1612229226, 1612224203, 1612230998,
1612224377, 1612224081, 1612225814, 1612241976, 1612227178, 1612228356,
1612231029, 1612241740, 1612243041, 1612227231, 1612224426, 1612225683,
1612227404, 1612227346, 1612227397, 1612231022, 1612231105, 1612232381,
1612232803, 1612240947, 1612241755, 1612245899, 1612247021, 1612247220,
1612225301, 1612230673, 1612240884, 1612225661, 1612229282, 1612244136,
1612224582, 1612224662, 1612233387, 1612226208, 1612227329, 1612228048,
1612229782), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
200L), class = "data.frame")
library(data.table)
setDT(DF)
DF <- unique(DF[, session_id := cumsum(starttimes-shift(endtimes, n=1L, fill=first(starttimes), type="lag") >= 3600), by = UserId][
, c("count", "starttimes", "endtimes") := .(.N, min(starttimes), max(endtimes)), by = .(UserId, session_id)][
, session_id := NULL])
head(DF, 10)
> head(DF, 10)
UserId starttimes endtimes count
1: 0buhGq 2021-02-02 00:24:55 2021-02-02 01:04:45 1
2: 0buhGq 2021-02-02 02:28:25 2021-02-02 02:28:29 1
3: 0buhGq 2021-02-02 04:19:52 2021-02-02 06:29:06 7
4: 0ulN53 2021-02-02 00:02:39 2021-02-02 01:04:59 4
5: 0ulN53 2021-02-02 02:26:43 2021-02-02 05:27:53 2
6: 0ulN53 2021-02-02 04:41:43 2021-02-02 04:41:48 1
7: 0ulN53 2021-02-02 05:59:32 2021-02-02 05:59:46 1
8: 0wzMkn 2021-02-02 00:04:09 2021-02-02 00:04:46 1
9: 0wzMkn 2021-02-02 01:55:07 2021-02-02 01:55:13 1
10: 0wzMkn 2021-02-02 05:16:33 2021-02-02 05:16:48 1
Benchmark:
Unit: milliseconds
expr min lq mean median uq max neval
shs 73.507502 73.507502 73.507502 73.507502 73.507502 73.507502 1
ismirsehregal 9.535402 9.535402 9.535402 9.535402 9.535402 9.535402 1
Benchmark code:
library(lubridate)
library(dplyr)
library(data.table)
library(microbenchmark)
DFdata.table <- DFdplyr <- DF
microbenchmark(shs = {
DFdplyr %>%
group_by(UserId) %>%
mutate(grp = {starttimes - lag(endtimes)} %>%
{. >= hours(1)} %>%
ifelse(is.na(.), 0, .) %>%
cumsum()
) %>%
group_by(UserId, grp) %>%
summarize(starttimes = first(starttimes),
endtimes = last(endtimes),
count = n())
},
ismirsehregal = {
setDT(DF)
DFdata.table <- unique(DFdata.table[, session_id := cumsum(starttimes-shift(endtimes, n=1L, fill=first(starttimes), type="lag") >= 3600), by = UserId][
, c("count", "starttimes", "endtimes") := .(.N, min(starttimes), max(endtimes)), by = .(UserId, session_id)][
, session_id := NULL])
}, times = 1L)
Related Topics
Shiny - Observe() Triggered by Dynamicaly Generated Inputs
Creating a Vertical Color Gradient for a Geom_Bar Plot
Ggplot2: Flip Axes and Maintain Aspect Ratio of Data
How to Convert Month-Year String to Date in R
How to Keep Midnight (00:00H) Using Strptime() in R
When Does the Argument Go Inside or Outside Aes()
Sort Year-Month Column by Year and Month
How Does R Handle Unicode/Utf-8
Calculating Time Difference by Id
Control Column Widths in a Ggplot2 Graph with a Series and Inconsistent Data
Automate Zip File Reading in R
Aggregating All Unique Values of Each Column of Data Frame