﻿ Counting Non Zero Values in Each Column of a Dataframe in Python - ITCodar

# Counting Non Zero Values in Each Column of a Dataframe in Python

## counting the number of non-zero numbers in a column of a df in pandas/python

Use double sum:

print df
a b c d e
0 0 1 2 3 5
1 1 4 0 5 2
2 5 8 9 6 0
3 4 5 0 0 0

print (df != 0).sum(1)
0 4
1 4
2 4
3 2
dtype: int64

print (df != 0).sum(1).sum()
14

If you need count only column c or d:

print (df['c'] != 0).sum()
2

print (df['d'] != 0).sum()
3

EDIT: Solution with numpy.sum:

print ((df != 0).values.sum())
14

## Running Count of Non-Zero Values in pd Dataframe Column

You can still use cumsum

df['running_total'] = df['payment'].ne(0).groupby(df['key_id']).cumsum()

## Get count of non zero values per row in Pandas DataFrame

Compare by gt (>), lt (<) or le,
ge,
ne,
eq first and then sum Trues, there are processing like 1:

Bad -> check all previous columns:

df['> zero'] = df.gt(0).sum(axis=1)
df['< zero'] = df.lt(0).sum(axis=1)
df['== zero'] = df.eq(0).sum(axis=1)
print (df)
GOOG AAPL XOM IBM Value > zero < zero == zero
2011-01-10 0.0 0.0 0.0 0.0 0.0 0 0 7
2011-01-13 0.0 -1500.0 0.0 4000.0 -61900.0 1 2 2

Correct - select columns for check:

cols = df.columns
df['> zero'] = df[cols].gt(0).sum(axis=1)
df['< zero'] = df[cols].lt(0).sum(axis=1)
df['== zero'] = df[cols].eq(0).sum(axis=1)
print (df)
GOOG AAPL XOM IBM Value > zero < zero == zero
2011-01-10 0.0 0.0 0.0 0.0 0.0 0 0 5
2011-01-13 0.0 -1500.0 0.0 4000.0 -61900.0 1 2 2

Detail:

print (df.gt(0))
GOOG AAPL XOM IBM Value
2011-01-10 False False False False False
2011-01-13 False False False True False

EDIT:

To remove some columns from the 'cols' use difference:

cols = df.columns.difference(['Value'])
print (cols)
Index(['AAPL', 'GOOG', 'IBM', 'XOM'], dtype='object')

df['> zero'] = df[cols].gt(0).sum(axis=1)
df['< zero'] = df[cols].lt(0).sum(axis=1)
df['== zero'] = df[cols].eq(0).sum(axis=1)
print (df)
GOOG AAPL XOM IBM Value > zero < zero == zero
2011-01-10 0.0 0.0 0.0 0.0 0.0 0 0 4
2011-01-13 0.0 -1500.0 0.0 4000.0 -61900.0 1 1 2

## How to count non-zero values in a dataframe inside a range of a column

You can use

sum(df.loc[i:j,'data'].ne(0))
#or
df.loc[i:j,'data'].ne(0).sum()

## Get column indices for non-zero values in each row in pandas data frame

One quick option is to apply numpy.flatnonzero to each row:

import numpy as np

df.apply(np.flatnonzero, axis=1)

0 [0, 1]
1 [0]
2 [1]
3 [0, 1, 2, 5, 7, 8]
dtype: object

If you care about performance, here is a pure numpy option (caveat for this option is if the row doesn't have any non zero values, it will be ignored in the result. Choose the method that works for you depending on your need):

idx, idy = np.where(df != 0)
np.split(idy, np.flatnonzero(np.diff(idx) != 0) + 1)

[array([0, 1], dtype=int32),
array([0], dtype=int32),
array([1], dtype=int32),
array([0, 1, 2, 5, 7, 8], dtype=int32)]

## pandas determine column labels that contribute to non-zero values in each row

To count your non-zeros in each row you can use nonzero_count from numpy package and perform the operation row-wise:

import numpy as np
df['non_zero_count'] = np.count_nonzero(df,axis=1)

>>> df

1 2 3 4 5 6 7 non_zero_count
0 8122 0 0 0 0 0 0 1
1 0 0 0 3292 0 1313 0 2
2 0 8675 0 0 0 0 0 1
3 0 0 1910 0 213 0 12312 3
4 0 0 0 0 4010 0 0 1
5 0 0 0 0 0 1002 0 1
6 0 0 0 0 0 0 1278 1

Then you can get the columns where a row contains a non-zero value with apply, so be cautious here if you have a big dataset at hand:

df['non_zero_label'] = df.drop('non_zero_count',axis=1)\
.apply(lambda r: r.index[r.ne(0)].to_list(), axis=1)
df

>>> df

1 2 3 4 5 6 7 non_zero_count non_zero_label
0 8122 0 0 0 0 0 0 1 [1]
1 0 0 0 3292 0 1313 0 2 [4, 6]
2 0 8675 0 0 0 0 0 1 [2]
3 0 0 1910 0 213 0 12312 3 [3, 5, 7]
4 0 0 0 0 4010 0 0 1 [5]
5 0 0 0 0 0 1002 0 1 [6]
6 0 0 0 0 0 0 1278 1 [7]

## PySpark write a function to count non zero values of given columns

You can use a list comprehension to generate the list of aggregation expressions:

import pyspark.sql.functions as F

def count_non_zero (df, features, grouping):
return df.groupBy(*grouping).agg(*[F.count(F.when(F.col(c) != 0, 1)).alias(c) for c in features])

## Add and count non-zero values of rows based on current date

We can filter the required columns using boolean indexing, then calculate and insert the total and active_months columns in df where total is computed by summing up the values along axis=1 and active_months is calculated by counting non-zero values along axis=1

m = pd.to_datetime(df.columns, errors='coerce') <= '1 May, 2021'
c = df.loc[:, m]

df.insert(2, 'total', c.sum(1))
df.insert(3, 'active_months', c.ne(0).sum(1))

>>> df

account_id contract_id total active_months 2020-12-01 00:00:00 2021-01-01 00:00:00 2021-02-01 00:00:00 2021-03-01 00:00:00 2021-04-01 00:00:00 2021-05-01 00:00:00 2021-06-01 00:00:00
0 1 A 200.0 1 200.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 B 600.0 2 300.0 300.0 0.0 0.0 0.0 0.0 0.0
2 1 C 1200.0 3 0.0 0.0 0.0 400.0 400.0 400.0 400.0
3 2 K 300.0 3 100.0 100.0 100.0 0.0 0.0 0.0 0.0
4 2 F 200.0 4 0.0 0.0 50.0 50.0 50.0 50.0 50.0

## Count of non-zero values in multiple rows in Python?

You can do this using iloc for slicing and numpy

np.sum((df.iloc[[0, 1], 1:]!=0).any(axis=0))

Here df.iloc[[0, 1], 1:] gives you first two rows and numpy sum is counting the total number of non zero pairs in the selected row. You can use df.iloc[[0, 1], 1:] to select any combination of rows.