Count number of zeros per row, and remove rows with more than n zeros
It's not only possible, but very easy:
DF[rowSums(DF == 0) <= 4, ]
You could also use apply
:
DF[apply(DF == 0, 1, sum) <= 4, ]
In Python, check for zeros in each row, if row has 3 or more zeros, remove the row. Current code does nothing to the file
Update
df = pd.read_csv('GiftYearTotal.csv', encoding='ISO-8859-1')
df = df.apply(lambda x: x.str.strip())
out = df[df.eq('$0.00').sum(1) <= 3]
Old answer
You can use:
out = df[df.eq('$0.00').sum(1) <= 3]
print(out)
# Output
Year 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
1 Person_B $100.00 $150.00 $1.00 $50.00 $0.25 $100.00 $0.00 $50.00 $60.00 $50.00 $0.00 $0.00 $1000.00
Remove rows in a dataframe if 0 is found X number of times
Here is a one-liner. Note that rowSums
is coded in C and is fast.
df[!rowSums(df == 0) >= 2, , drop = FALSE]
Counting number of zeros per row by Pandas DataFrame?
Use a boolean comparison which will produce a boolean df, we can then cast this to int, True becomes 1, False becomes 0 and then call count
and pass param axis=1
to count row-wise:
In [56]:
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
df
Out[56]:
a b c
0 1 0 0
1 0 0 0
2 0 1 0
3 1 0 0
4 3 1 0
In [64]:
(df == 0).astype(int).sum(axis=1)
Out[64]:
0 2
1 3
2 2
3 2
4 1
dtype: int64
Breaking the above down:
In [65]:
(df == 0)
Out[65]:
a b c
0 False True True
1 True True True
2 True False True
3 False True True
4 False False True
In [66]:
(df == 0).astype(int)
Out[66]:
a b c
0 0 1 1
1 1 1 1
2 1 0 1
3 0 1 1
4 0 0 1
EDIT
as pointed out by david the astype
to int
is unnecessary as the Boolean
types will be upcasted to int
when calling sum
so this simplifies to:
(df == 0).sum(axis=1)
Deleting rows have most of the value zero
We can use rowSums
df[rowSums(df == 0) < 3, ]
# i j k l m n
#b 8 6 34 1 0 0
#d 7 9 3 7 0 5
#f 2 3 9 6 8 9
#g 0 1 0 3 1 5
We can also use apply
and count row-wise number of 0's and then subset
df[apply(df == 0, 1, sum) < 3, ]
Pandas dataframe drop rows which store certain number of zeros in it
This will work:
drop_indexs = []
for i in range(len(df.iloc[:,0])):
if (df.iloc[i,:]==0).sum()>=4: # 4 is how many zeros should row min have
drop_indexs.append(i)
updated_df = df.drop(drop_indexs)
Excluding rows containting consecutive zeros from data frame
If we are looking for any consecutive zeros in each row and want to exclude that row, one way would be to loop through the rows using apply
and MARGIN=1
. Check whether there are any
of the adjacent elements are equal and are zero, do the negation and subset the rows.
df1[!apply(df1[-(1:2)], 1, FUN = function(x) any((c(FALSE, x[-1]==x[-length(x)])) & !x)),]
# subj stimulus var1 var2 var3 var4
#1 1 A 25 30 15 36
#3 1 C 12 0 20 23
Or if we need consecutive zero length to be 'n', then rle
can be applied to each row, check whether the lengths
for 'values' that are 0 is 'n', negate and subset the rows.
df1[!apply(df1[-(1:2)], 1, FUN = function(x) any(with(rle(x==0), lengths[values])==2)),]
# subj stimulus var1 var2 var3 var4
#1 1 A 25 30 15 36
#3 1 C 12 0 20 23
Pandas: drop row if more than one of multiple columns is zero
Apply the condition and count the True
values.
(df == 0).sum(1)
ID1 2
ID2 0
ID3 1
dtype: int64
df[(df == 0).sum(1) < 2]
col0 col1 col2 col3
ID2 1 1 2 10
ID3 0 1 3 4
Alternatively, convert the integers to bool and sum that. A little more direct.
# df[(~df.astype(bool)).sum(1) < 2]
df[df.astype(bool).sum(1) > len(df.columns)-2] # no inversion needed
col0 col1 col2 col3
ID2 1 1 2 10
ID3 0 1 3 4
For performance, you can use np.count_nonzero
:
# df[np.count_nonzero(df, axis=1) > len(df.columns)-2]
df[np.count_nonzero(df.values, axis=1) > len(df.columns)-2]
col0 col1 col2 col3
ID2 1 1 2 10
ID3 0 1 3 4
df = pd.concat([df] * 10000, ignore_index=True)
%timeit df[(df == 0).sum(1) < 2]
%timeit df[df.astype(bool).sum(1) > len(df.columns)-2]
%timeit df[np.count_nonzero(df.values, axis=1) > len(df.columns)-2]
7.13 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.28 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
997 µs ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Related Topics
Starting a Daily Time Series in R
Creating Dummy Variables in R Data.Table
List Distinct Values in a Vector in R
R Convert Zipcode or Lat/Long to County
Why True == "True" Is True in R
Opening Shiny App Directly in the Default Browser
Set Locale to System Default Utf-8
R + Ggplot2 => Add Labels on Facet Pie Chart
Dplyr If_Else() VS Base R Ifelse()
Split Up '...' Arguments and Distribute to Multiple Functions
Extracting Unique Numbers from String in R
Finding Overlaps Between Interval Sets/Efficient Overlap Joins
Linear Regression Loop for Each Independent Variable Individually Against Dependent
Adding Percentage Labels to a Bar Chart in Ggplot2