How do I delete a column that contains only zeros in Pandas?
df.loc[:, (df != 0).any(axis=0)]
Here is a break-down of how it works:
In [74]: import pandas as pd
In [75]: df = pd.DataFrame([[1,0,0,0], [0,0,1,0]])
In [76]: df
Out[76]:
0 1 2 3
0 1 0 0 0
1 0 0 1 0
[2 rows x 4 columns]
df != 0
creates a boolean DataFrame which is True where df
is nonzero:
In [77]: df != 0
Out[77]:
0 1 2 3
0 True False False False
1 False False True False
[2 rows x 4 columns]
(df != 0).any(axis=0)
returns a boolean Series indicating which columns have nonzero entries. (The any
operation aggregates values along the 0-axis -- i.e. along the rows -- into a single boolean value. Hence the result is one boolean value for each column.)
In [78]: (df != 0).any(axis=0)
Out[78]:
0 True
1 False
2 True
3 False
dtype: bool
And df.loc
can be used to select those columns:
In [79]: df.loc[:, (df != 0).any(axis=0)]
Out[79]:
0 2
0 1 0
1 0 1
[2 rows x 2 columns]
To "delete" the zero-columns, reassign df
:
df = df.loc[:, (df != 0).any(axis=0)]
How do I delete columns that contain a zeros value in Pandas?
Try:
df.loc[:,~df.eq(0).any()]
OR
as suggested by @sammywemmy
df.loc[:, df.ne(0).all()]
Other possible solutions:
df.mask(df.eq(0)).dropna(axis=1)
#OR
df.drop(df.columns[df.eq(0).any()],1)
output of above code:
Names Henry Jesscia
0 Robert 54 5
1 Dan 22 55
Fast removal of only zero columns in pandas dataframe
There is a much faster way to implement that using Numba.
Indeed, most of the Numpy implementation will create huge temporary arrays that are slow to fill and read. Moreover, Numpy will iterate over the full dataframe while this is often not needed (at least in your example). The point is that you can very quickly know if you need to keep a column by just iteratively check column values and early stop the computation of the current column if there is any 0 (typically at the beginning). Moreover, there is no need to always copy the entire dataframe (using about 1.9 GiB of memory): when all the columns are kept. Finally, you can perform the computation in parallel.
However, there are performance-critical low-level catches. First, Numba cannot deal with Pandas dataframes, but the conversion to a Numpy array is almost free using df.values
(the same thing applies for the creation of a new dataframe). Moreover, regarding the memory layout of the array, it could be better to iterate either over the lines or over the columns in the innermost loop.
This layout can be fetched by checking the strides of the input dataframe Numpy array.
Note that the example use a row-major dataframe due to the (unusual) Numpy random initialization, but most dataframes tend to be column major.
Here is an optimized implementation:
import numba as nb
@nb.njit('int_[:,:](int_[:,:])', parallel=True)
def filterNullColumns(dfValues):
n, m = dfValues.shape
s0, s1 = dfValues.strides
columnMajor = s0 < s1
toKeep = np.full(m, False, dtype=np.bool_)
# Find the columns to keep
# Only-optimized for column-major dataframes (quite complex otherwise)
for colId in nb.prange(m):
for rowId in range(n):
if dfValues[rowId, colId] != 0:
toKeep[colId] = True
break
# Optimization: no columns are discarded
if np.all(toKeep):
return dfValues
# Create a new dataframe
newColCount = np.sum(toKeep)
res = np.empty((n,newColCount), dtype=dfValues.dtype)
if columnMajor:
newColId = 0
for colId in nb.prange(m):
if toKeep[colId]:
for rowId in range(n):
res[rowId, newColId] = dfValues[rowId, colId]
newColId += 1
else:
for rowId in nb.prange(n):
newColId = 0
for colId in range(m):
res[rowId, newColId] = dfValues[rowId, colId]
newColId += toKeep[colId]
return res
result = pd.DataFrame(filterNullColumns(df.values))
Here are the result on my 6-core machine:
Reference: 1094 ms
Valdi_Bo answer: 1262 ms
This implementation: 0.056 ms (300 ms with discarded columns)
This, the implementation is about 20 000 times faster than the reference implementation on the provided example (no discarded column) and 4.2 times faster on more pathological cases (only one column discarded).
If you want to reach even faster performance, then you can perform the computation in-place (dangerous, especially due to Pandas) or use smaller datatypes (like np.uint8
or np.int16
) since the computation is mainly memory-bound.
How do I delete a column that contains only zeros from a given row in pandas
You can change data structure:
df = df.reset_index().melt('index', var_name='columns').query('value != 0')
print (df)
index columns value
0 0 a 1
1 1 a 1
2 0 b 1
5 1 c 1
If need new column by values joined by ,
compare values for not equal by DataFrame.ne
and use matrix multiplication by DataFrame.dot
:
df['new'] = df.ne(0).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d new
0 1 1 0 0 a, b
1 1 0 1 0 a, c
EDIT:
for i in df.index:
row = df.loc[[i]]
a = row.loc[:, (row != 0).any()]
print ('Row {}'.format(i))
print (a)
Or:
def f(x):
print ('Row {}'.format(x.name))
print (x[x!=0].to_frame().T)
df.apply(f, axis=1)
Row 0
a b
0 1 1
Row 1
a c
1 1 1
Drop all columns where all values are zero
If it's a matter of 0s and not sum, use df.any
:
In [291]: df.T[df.any()].T
Out[291]:
b
0 0
1 -1
2 0
3 1
Alternatively:
In [296]: df.T[(df != 0).any()].T # or df.loc[:, (df != 0).any()]
Out[296]:
b
0 0
1 -1
2 0
3 1
How to delete columns containing all zeros from a csv file in python?
The following approach could be used with the csv
library:
- Read the header in
- Read the rows in
- Transpose the list of rows into a list of columns (using
zip
) - Use a set to drop all columns that only contain
0
- Write out the new header
- Write out the transposed list of columns as a list of rows.
For example:
import csv
with open('file.csv', newline='') as f_input:
csv_input = csv.reader(f_input)
header = next(csv_input) # read header
columns = zip(*list(csv_input)) # read rows and transpose to columns
data = [(h, c) for h, c in zip(header, columns) if set(c) != set('0')]
with open('file2.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(h for h, c in data) # write the new header
csv_output.writerows(zip(*[c for h, c in data]))
Best way to remove all columns and rows with zero sum from a pandas dataframe
here we go:
ad = np.array([[1, 0, 1, 0, 1],
[0, 0, 0, 0, 0],
[1, 1, 1, 0, 1],
[0, 1, 1, 0, 1],
[1, 1, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 1]])
df = pd.DataFrame(ad)
df.drop(df.loc[df.sum(axis=1)==0].index, inplace=True)
df.drop(columns=df.columns[df.sum()==0], inplace=True)
The code above will drop the row/column, when the sum of the row/column is zero
. This is achived by calucating the sum along the axis 1
for rows and 0
for columns and then dropting the row/column with a sum of 0
(df.drop(...)
)
Drop columns with more than 70% zeros
Just change df.isnull().mean()
to (df==0).mean()
:
df = df.loc[:, (df==0).mean() < .7]
Here's a demo:
df
Out:
0 1 2 3 4
0 1 1 1 1 0
1 1 0 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 1 1 1 1
5 1 0 0 0 0
6 0 1 0 0 0
7 0 1 1 0 0
8 1 0 0 1 0
9 0 0 0 1 0
(df==0).mean()
Out:
0 0.4
1 0.5
2 0.6
3 0.5
4 0.8
dtype: float64
df.loc[:, (df==0).mean() < .7]
Out:
0 1 2 3
0 1 1 1 1
1 1 0 0 0
2 0 1 1 0
3 1 0 0 1
4 1 1 1 1
5 1 0 0 0
6 0 1 0 0
7 0 1 1 0
8 1 0 0 1
9 0 0 0 1
Related Topics
How to Install Tesseract for Python on Anaconda
Python | Make the Percentage of a List
Django Login - Missing 1 Required Positional Argument
Python: Import Cx_Oracle Importerror: No Module Named Cx_Oracle Error Is Thown
Django Development Server, How to Stop It When It Run in Background
How to Crop the Black Background of the Image Using Opencv in Python
How to Assign Class Instance to a Variable and Use That in Other Class
Python Opencv Cv2 - How to Increase the Brightness and Contrast of an Image by 100%
How to Extract Data from Text Field in Pandas Dataframe
Selenium Python Send_Key Error: List Object Has No Attribute
Plotting Data from Multiple Pandas Data Frames in One Plot
How to Test If a Column Exists and Is Not Null in a Dataframe
Formal and Actual Parameters in a Function in Python
Python Comparing List Values to Keys in List of Dicts
Removing Punctuations and Spaces in a String Without Using Regex
How to Generate and Open an Outlook Email With Python (But Do Not Send)
How to Get the Url of the Active Google Chrome Tab in Windows