How to Drop Rows of Pandas Dataframe Whose Value in a Certain Column Is Nan

How to drop rows of Pandas DataFrame whose value in a certain column is NaN

Don't drop, just take the rows where EPS is not NA:

df = df[df['EPS'].notna()]

Python: How to drop a row whose particular column is empty/NaN?

Use dropna with parameter subset for specify column for check NaNs:

data = data.dropna(subset=['sms'])
print (data)
id city department sms category
1 2 lhr revenue good 1

Another solution with boolean indexing and notnull:

data = data[data['sms'].notnull()]
print (data)
id city department sms category
1 2 lhr revenue good 1

Alternative with query:

print (data.query("sms == sms"))
id city department sms category
1 2 lhr revenue good 1

Timings

#[300000 rows x 5 columns]
data = pd.concat([data]*100000).reset_index(drop=True)

In [123]: %timeit (data.dropna(subset=['sms']))
100 loops, best of 3: 19.5 ms per loop

In [124]: %timeit (data[data['sms'].notnull()])
100 loops, best of 3: 13.8 ms per loop

In [125]: %timeit (data.query("sms == sms"))
10 loops, best of 3: 23.6 ms per loop

how to drop rows with 'nan' in a column in a pandas dataframe?

I think what you're doing is taking one column from a DataFrame, removing all the NaNs from it, but then adding that column to the same DataFrame again - where any missing values from the index will be filled by NaNs again.

Do you want to remove that row from the entire DataFrame? If yes, try df.dropna(subset=["col1"])

Trying to Drop values by column (I convert these values to nan but could be anything) not working

Passing axis is not support for dask dataframes as of now. You cvan also print docstring of the function via ddf.dropna? and it will tell you the same:

Signature: ddf.dropna(how='any', subset=None, thresh=None)
Docstring:
Remove missing values.

This docstring was copied from pandas.core.frame.DataFrame.dropna.

Some inconsistencies with the Dask version may exist.

See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.

Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0 (Not supported in Dask)
Determine if rows or columns which contain missing values are
removed.

* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.

.. versionchanged:: 1.0.0

Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.

how : {'any', 'all'}, default 'any'
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.

* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.

thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False (Not supported in Dask)
If True, do operation inplace and return None.

Returns
-------
DataFrame or None
DataFrame with NA entries dropped from it or None if ``inplace=True``.

Worth noting that Dask Documentation is copied from pandas for many instances like this. But wherever it does, it specifically states that:

This docstring was copied from pandas.core.frame.DataFrame.drop. Some
inconsistencies with the Dask version may exist.

Therefore its always best to check docstring for dask's pandas-driven functions instead of relying on documentation

How to drop rows in a df based on NaN values in specific columns not using column names but integer position for the subset?

Instead of deleting the rows that you do not want, try keeping those that you want:

df[df.iloc[:,[2,3]].notnull().all(axis=1)]

But what is wrong with getting the column names by index?

df.dropna(subset=df.columns[[2,3]])

Drop row if column entry contains NaN

You can identify all index positions that are equal to NaN for the exploded data frame and can then filter the data frame for those that are not in the index array:

ser = pd.DataFrame(data={"col": [[1, 2, 3, np.nan, np.nan], [3, 4, 5], [3, 9], [np.nan, 10]]})['col'] 

ser_exploded = ser.explode()
ser[~ser.index.isin(np.unique(ser_exploded[ser_exploded.isna()].index))]

--------------------------------------
1 [3, 4, 5]
2 [3, 9]
Name: col, dtype: object
--------------------------------------

How to drop entire record if more than 90% of features have missing value in pandas

You can use df.dropna() and set the thresh parameter to the value that corresponds to 10% of your columns (the minimum number of non-NA values).

df.dropna(axis=0, thresh=50, inplace=True)

Squeeze dataframe rows with missing values

For each row remove missing values in Series.dropna, rename columns by dictionary and last add missing columns in DataFrame.reindex:

df = (df1.apply(lambda x: pd.Series(x.dropna().to_numpy()), axis=1)
.rename(columns=dict(enumerate(df1.columns)))
.reindex(df1.columns, axis=1))

print (df)
A B C
0 1 100.0 NaN
1 2 20.0 NaN
2 300.0 NaN NaN
3 bla 400.0 NaN

Another idea:

df = (df1.apply(lambda x: x.sort_values(key=lambda x: x.isna()).to_numpy(), 
axis=1,
result_type='expand')
.set_axis(df1.columns, axis=1)
.mask(lambda x: x.isna())
)
print (df)
A B C
0 1 100.0 NaN
1 2 20.0 NaN
2 300.0 NaN NaN
3 bla 400.0 NaN


df = (df1.apply(lambda x: x.sort_values(key=lambda x: x.isna()).to_numpy(), 
axis=1,
result_type='expand')
.set_axis(df1.columns, axis=1)
)
print (df)
A B C
0 1 100.0 <NA>
1 2 20.0 NaN
2 300.0 NaN NaN
3 bla 400.0 <NA>


Related Topics



Leave a reply



Submit