Count Total Missing Values by Group

Count total missing values by group?

data.table solution

library(data.table)
setDT(df1)

df1[, .(sumNA = sum(is.na(.SD))), by = Z]

# Z sumNA
# 1: A 559
# 2: C 661
# 3: E 596
# 4: B 597
# 5: D 560

dplyr solution using rowSums(.[-1]), i.e. row-sums for all columns except the first.

library(dplyr)

df1 %>%
group_by(Z) %>%
summarise_all(~sum(is.na(.))) %>%
transmute(Z, sumNA = rowSums(.[-1]))

# # A tibble: 5 x 2
# Z sumNA
# <fct> <dbl>
# 1 A 559
# 2 B 597
# 3 C 661
# 4 D 560
# 5 E 596

How to group by and count number of none missing values for each column in group in pandas

You can use DF.GroupBy.count which includes only Non-NaN entries while counting. So, you can let var be the grouped key and then aggregate the counts respectively for the two selected columns of the DF as shown:

cols = ['qualified_date', 'loyal_date']
df.groupby('var')[cols].agg('count').add_suffix("_count").reset_index()

Sample Image

Count missing values of several variables by group with many variables

You can not use dataframe as the grouper since it is two dimensional object. Groupby requires one dimensional arrays/series as grouper. So in order to fix the error you can create a list of one dimensional series(grp_cols) and then use it as a grouper to count the missing values

keys = ['type', 'type2']
grp_cols = [df[k] for k in keys]
df.drop(keys, axis=1).isna().groupby(grp_cols, sort=False).sum().reset_index()

Alternatively, there is a simpler approach where you have to first set the grouping columns as index then use isna followed by groupby + sum to count the missing values

keys = ['type', 'type2']
df.set_index(keys).isna().groupby(keys, sort=False).sum().reset_index()


  type type2  var1  var2
0 A Z 0 1
1 A Y 1 1
2 B Y 0 0

Groupby class and count missing values in features

Compute a mask with isna, then group and find the sum:

df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum().reset_index()

CLASS FEATURE1 FEATURE2 FEATURE3
0 X 1.0 1.0 2.0
1 B 0.0 0.0 0.0

Another option is to subtract the size from the count using rsub along the 0th axis for index aligned subtraction:

df.groupby('CLASS').count().rsub(df.groupby('CLASS').size(), axis=0)

Or,

g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)

       FEATURE1  FEATURE2  FEATURE3
CLASS
B 0 0 0
X 1 1 2

There are quite a few good answers, so here are some timeits for your perusal:

df_ = df
df = pd.concat([df_] * 10000)

%timeit df.drop('CLASS', 1).isna().groupby(df.CLASS, sort=False).sum()
%timeit df.set_index('CLASS').isna().sum(level=0)
%%timeit
g = df.groupby('CLASS')
g.count().rsub(g.size(), axis=0)

11.8 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.47 ms ± 379 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.54 ms ± 81.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Actual performance depends on your data and setup, so your mileage may vary.

How to fill missing values in a dataframe based on group value counts?

I did a little extra transformation to get stat as a dictionary mapping the year to its highest frequency name (credit to this answer):

In[0]:
fill_dict = stat.unstack().idxmax(axis=1).to_dict()
fill_dict

Out[0]:
{2015: 'good', 2016: 'good', 2017: 'excellent'}

Then use fillna with map based on this dictionary (credit to this answer):

In[0]:
X['condition'] = X['condition'].fillna(X['year'].map(fill_dict))
X

Out[0]:
year condition
0 2015 good
1 2016 good
2 2017 excellent
3 2016 good
4 2016 excellent
5 2017 excellent
6 2015 good
7 2016 good
8 2015 excellent
9 2015 good

counting grouped missing values in R

here's finally the answer

missing_data %>% group_by(some_group) %>% summarise(count=sum(some_column==0)) %>% arrange(desc(count))

Pandas count NAs with a groupby for all columns

It's always best to avoid groupby.apply in favor of the basic functions which are cythonized, as this scales better with many groups. This will lead to a great increase in performance. In this case first check isnull() on the entire DataFrame then groupby + sum.

df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
# b c
#a
#1 1 1
#2 1 0

To illustrate the performance gain:

import pandas as pd
import numpy as np

N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
'b': np.random.choice([1, np.nan], N),
'c': np.random.choice([1, np.nan], N)})

%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Related Topics



Leave a reply



Submit