How to Find Which Columns Contain Any Nan Value in Pandas Dataframe

How to find which columns contain any NaN value in Pandas dataframe

UPDATE: using Pandas 0.22.0

Newer Pandas versions have new methods 'DataFrame.isna()' and 'DataFrame.notna()'

In [71]: df
Out[71]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1

In [72]: df.isna().any()
Out[72]:
a True
b True
c False
dtype: bool

as list of columns:

In [74]: df.columns[df.isna().any()].tolist()
Out[74]: ['a', 'b']

to select those columns (containing at least one NaN value):

In [73]: df.loc[:, df.isna().any()]
Out[73]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0

OLD answer:

Try to use isnull():

In [97]: df
Out[97]:
a b c
0 NaN 7.0 0
1 0.0 NaN 4
2 2.0 NaN 4
3 1.0 7.0 0
4 1.0 3.0 9
5 7.0 4.0 9
6 2.0 6.0 9
7 9.0 6.0 4
8 3.0 0.0 9
9 9.0 0.0 1

In [98]: pd.isnull(df).sum() > 0
Out[98]:
a True
b True
c False
dtype: bool

or as @root proposed clearer version:

In [5]: df.isnull().any()
Out[5]:
a True
b True
c False
dtype: bool

In [7]: df.columns[df.isnull().any()].tolist()
Out[7]: ['a', 'b']

to select a subset - all columns containing at least one NaN value:

In [31]: df.loc[:, df.isnull().any()]
Out[31]:
a b
0 NaN 7.0
1 0.0 NaN
2 2.0 NaN
3 1.0 7.0
4 1.0 3.0
5 7.0 4.0
6 2.0 6.0
7 9.0 6.0
8 3.0 0.0
9 9.0 0.0

How to check if any value is NaN in a Pandas DataFrame

jwilner's response is spot on. I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:

df.isnull().values.any()

Sample Image

import numpy as np
import pandas as pd
import perfplot

def setup(n):
df = pd.DataFrame(np.random.randn(n))
df[df > 0.9] = np.nan
return df

def isnull_any(df):
return df.isnull().any()

def isnull_values_sum(df):
return df.isnull().values.sum() > 0

def isnull_sum(df):
return df.isnull().sum() > 0

def isnull_values_any(df):
return df.isnull().values.any()

perfplot.save(
"out.png",
setup=setup,
kernels=[isnull_any, isnull_values_sum, isnull_sum, isnull_values_any],
n_range=[2 ** k for k in range(25)],
)

df.isnull().sum().sum() is a bit slower, but of course, has additional information -- the number of NaNs.

Python pandas Filtering out nan from a data selection of a column of strings

Just drop them:

nms.dropna(thresh=2)

this will drop all rows where there are at least two non-NaN.

Then you could then drop where name is NaN:

In [87]:

nms
Out[87]:
movie name rating
0 thg John 3
1 thg NaN 4
3 mol Graham NaN
4 lob NaN NaN
5 lob NaN NaN

[5 rows x 3 columns]
In [89]:

nms = nms.dropna(thresh=2)
In [90]:

nms[nms.name.notnull()]
Out[90]:
movie name rating
0 thg John 3
3 mol Graham NaN

[2 rows x 3 columns]

EDIT

Actually looking at what you originally want you can do just this without the dropna call:

nms[nms.name.notnull()]

UPDATE

Looking at this question 3 years later, there is a mistake, firstly thresh arg looks for at least n non-NaN values so in fact the output should be:

In [4]:
nms.dropna(thresh=2)

Out[4]:
movie name rating
0 thg John 3.0
1 thg NaN 4.0
3 mol Graham NaN

It's possible that I was either mistaken 3 years ago or that the version of pandas I was running had a bug, both scenarios are entirely possible.

How to select rows with NaN in particular column?

Try the following:

df[df['Col2'].isnull()]

Check if columns have a nan value if certain column has a specific value in Dataframe

so you have an if-elif-else situation. Then we can use np.select for it. It needs the conditions and what to do when they are satisfied:

  • your if is:    "condition is 1 and a,b,c has all nan"
  • your elif is: "condition is nan"
  • what remains is else, as usual
conditions = [df.condition.eq(1) & df[["a", "b", "c"]].isna().all(axis=1),
df.condition.isna()]

what_to_do = ["O", "-"]
else_case = "X"

df["check_result"] = np.select(conditions, what_to_do, default=else_case)

df
   condition    a    b    c check_result
0 1.0 NaN NaN 3.0 X
1 NaN 4.0 2 2.0 -
2 NaN 5.0 e 1.0 -
3 NaN 6.0 2 2.0 -
4 1.0 NaN NaN NaN O

So we don't write else's condition. It goes to default.

Pandas select all columns without NaN

You can create with non-NaN columns using

df = df[df.columns[~df.isnull().all()]]

Or

null_cols = df.columns[df.isnull().all()]
df.drop(null_cols, axis = 1, inplace = True)

If you wish to remove columns based on a certain percentage of NaNs, say columns with more than 90% data as null

cols_to_delete = df.columns[df.isnull().sum()/len(df) > .90]
df.drop(cols_to_delete, axis = 1, inplace = True)

find columns in Dataframe where every row has a value

Use if no values are missing values:

df1 = df.loc[:, df.notna().all()]
#oldier pandas versions
#df1 = df.loc[:, df.notnull().all()]

print (df1)
B D
1 2 2
2 2 1
3 3 1

Explanation:

Compare no missing values by by notna:

print (df.notna())
A B C D
1 True True True True
2 True True False True
3 False True True True

Check if all values in columns are True by DataFrame.all:

print (df.notna().all())
A False
B True
C False
D True
dtype: bool

If no values are empty strings compare by DataFrame.ne (!=):

df = df.loc[:, df.ne('').all()]

How to check if a pandas dataframe contains only numeric values column-wise?

You can check that using to_numeric and coercing errors:

pd.to_numeric(df['column'], errors='coerce').notnull().all()

For all columns, you can iterate through columns or just use apply

df.apply(lambda s: pd.to_numeric(s, errors='coerce').notnull().all())

E.g.

df = pd.DataFrame({'col' : [1,2, 10, np.nan, 'a'], 
'col2': ['a', 10, 30, 40 ,50],
'col3': [1,2,3,4,5.0]})

Outputs

col     False
col2 False
col3 True
dtype: bool


Related Topics



Leave a reply



Submit