Count total missing values by group?
data.table
solution
library(data.table)
setDT(df1)
df1[, .(sumNA = sum(is.na(.SD))), by = Z]
# Z sumNA
# 1: A 559
# 2: C 661
# 3: E 596
# 4: B 597
# 5: D 560
dplyr
solution using rowSums(.[-1])
, i.e. row-sums for all columns except the first.
library(dplyr)
df1 %>%
group_by(Z) %>%
summarise_all(~sum(is.na(.))) %>%
transmute(Z, sumNA = rowSums(.[-1]))
# # A tibble: 5 x 2
# Z sumNA
# <fct> <dbl>
# 1 A 559
# 2 B 597
# 3 C 661
# 4 D 560
# 5 E 596
How to group by and count number of none missing values for each column in group in pandas
You can use DF.GroupBy.count
which includes only Non-NaN
entries while counting. So, you can let var
be the grouped key and then aggregate the counts respectively for the two selected columns of the DF
as shown:
cols = ['qualified_date', 'loyal_date']
df.groupby('var')[cols].agg('count').add_suffix("_count").reset_index()
Count missing values of several variables by group with many variables
You can not use dataframe as the grouper since it is two dimensional object. Groupby requires one dimensional arrays/series as grouper. So in order to fix the error you can create a list of one dimensional series(grp_cols
) and then use it as a grouper to count the missing values
keys = ['type', 'type2']
grp_cols = [df[k] for k in keys]
df.drop(keys, axis=1).isna().groupby(grp_cols, sort=False).sum().reset_index()
Alternatively, there is a simpler approach where you have to first set the grouping columns as index then use isna
followed by groupby
+ sum
to count the missing values
keys = ['type', 'type2']
df.set_index(keys).isna().groupby(keys, sort=False).sum().reset_index()
type type2 var1 var2
0 A Z 0 1
1 A Y 1 1
2 B Y 0 0
Groupby class and count missing values in features
Compute a mask with isna
, then group and find the sum:
df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()
CLASS FEATURE1 FEATURE2 FEATURE3
0 X 1.0 1.0 2.0
1 B 0.0 0.0 0.0
Another option is to subtract the size
from the count
using rsub
along the 0th axis for index aligned subtraction:
df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)
Or,
g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)
FEATURE1 FEATURE2 FEATURE3
CLASS
B 0 0 0
X 1 1 2
There are quite a few good answers, so here are some timeits
for your perusal:
df_ = df
df = pd.concat([df_] * 10000)
%timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
%timeit df.set_index('CLASS').isna().sum(level=0)
%%timeit
g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)
11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Actual performance depends on your data and setup, so your mileage may vary.
How to fill missing values in a dataframe based on group value counts?
I did a little extra transformation to get stat
as a dictionary mapping the year to its highest frequency name (credit to this answer):
In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict
Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}
Then use fillna
with map
based on this dictionary (credit to this answer):
In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X
Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good
counting grouped missing values in R
here's finally the answer
missing_data %>% group_by(some_group) %>% summarise(count=sum(some_column==0)) %>% arrange(desc(count))
Pandas count NAs with a groupby for all columns
It's always best to avoid groupby.apply
in favor of the basic functions which are cythonized, as this scales better with many groups. This will lead to a great increase in performance. In this case first check isnull()
on the entire DataFrame
then groupby
+ sum
.
df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
# b c
#a
#1 1 1
#2 1 0
To illustrate the performance gain:
import pandas as pd
import numpy as np
N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
'b': np.random.choice([1, np.nan], N),
'c': np.random.choice([1, np.nan], N)})
%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related Topics
Plot Separate Years on a Common Day-Month Scale
R Shiny Loop to Display Multiple Plots
Create a New Variable Based on the First 7 Characters of Existing Variable
How to Automatically Load Data in an R Package
How to Change Name of Factor Levels
How to Add Legend to Geom_Smooth in Ggplot in R
Max and Min Functions That Are Similar to Colmeans
Display Duplicate Records in Data.Frame and Omit Single Ones
Extract Last Non-Missing Value in Row with Data.Table
Repeat the Re-Sampling Function for 1000 Times? Using Lapply
R Ggplot Ordering Bars in "Barplot-Like " Plot
Convert Table into Matrix by Column Names
Colors Lost in Legend When Using Scale_Shape_Manual
Contrast Between Label and Background: Determine If Color Is Light or Dark
How to Pass Individual 'Curvature' Arguments in 'Ggplot2' 'Geom_Curve' Function