What is the difference between NaN and None?
NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.
Wes writes in the docs 'choice of NA-representation':
After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions
isnull
andnotnull
which can be used across the dtypes to detect NA values.
...
Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.
Note: the "gotcha" that integer Series containing missing data are upcast to floats.
In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.
# without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype=object)
s_good = pd.Series([1, np.nan])
In [13]: s_bad.dtype
Out[13]: dtype('O')
In [14]: s_good.dtype
Out[14]: dtype('float64')
Jeff comments (below) on this:
np.nan
allows for vectorized operations; its a float value, whileNone
, by definition, forces object type, which basically disables all efficiency in numpy.So repeat 3 times fast: object==bad, float==good
Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):
In [15]: s_bad.sum()
Out[15]: 1
In [16]: s_good.sum()
Out[16]: 1.0
To answer the second question:
You should be using pd.isnull
and pd.notnull
to test for missing data (NaN).
How to make pandas discern the difference between None and NaN in python?
If pandas
interprets a column's dtype
to be numeric, then all nulls None
or np.nan
will become np.nan
. The only way for pandas to preserve None
and np.nan
in the same column is to have the dtype
be object
. However, it is important to point out that if the dtype
is object
you lose many of the benefits of having a numeric dtype
like efficient calculations.
pd.Series([1, None, np.nan, 2])
0 1.0
1 NaN
2 NaN
3 2.0
dtype: float64
pd.Series([1, None, np.nan, 2], dtype=object)
0 1
1 None
2 NaN
3 2
dtype: object
s1 = pd.Series([1, None, np.nan, 2])
s2 = pd.Series([1, None, np.nan, 2], dtype=object)
%timeit s1 + 1
%timeit s2 + 1
68 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
169 µs ± 5.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Differences between null and NaN in spark? How to deal with it?
null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.
NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0
.
One possible way to handle null values is to remove them with:
df.na.drop()
Or you can change them to an actual value (here I used 0) with:
df.na.fill(0)
Another way would be to select the rows where a specific column is null for further processing:
df.where(col("a").isNull())
df.where(col("a").isNotNull())
Rows with NaN can also be selected using the equivalent method:
from pyspark.sql.functions import isnan
df.where(isnan(col("a")))
Why does pandas use NaN from numpy, instead of its own null value?
A main dependency of pandas
is numpy
, in other words, pandas is built on-top of numpy. Because pandas inherits and uses many of the numpy methods, it makes sense to keep things consistent, that is, missing numeric data are represented with np.NaN
.
(This choice to build upon numpy has consequences for other things too. For instance date and time operations are built upon the np.timedelta64
and np.datetime64
dtypes, not the standard datetime
module.)
One thing you may not have known is that numpy
has always been there with pandas
import pandas as pd
pd.np?
pd.np.nan
Though you might think this behavior could be better since you don't import numpy, this is discouraged and in the near future will be deprecated in favor of directly importing numpy
FutureWarning: The pandas.np module is deprecated and will be removed
from pandas in a future version. Import numpy directly instead
Is it conventional to use np.nan
(rather than None
) to represent null values in pandas?
If the data are numeric then yes, you should use np.NaN
. None
requires the dtype to be Object
and with pandas you want numeric data stored in a numeric dtype. pandas
will generally coerce to the proper null-type upon creation or import so that it can use the correct dtype
pd.Series([1, None])
#0 1.0
#1 NaN <- None became NaN so it can have dtype: float64
#dtype: float64
Why did pandas not have its own null value for most of its lifetime (until last year)? What was the motivation for adding?
pandas
did not have it's own null value because it got by with np.NaN
, which worked for the majority of circumstances. However with pandas
it's very common to have missing data, an entire section of the documentation is devoted to this. NaN
, being a float, does not fit into an integer container which means that any numeric Series with missing data is upcast to float
. This can become problematic because of floating point math, and some integers cannot be represented perfectly with by a floating point number. As a result, any joins or merges
could possibly fail.
# Gets upcast to float
pd.Series([1,2,np.NaN])
#0 1.0
#1 2.0
#2 NaN
#dtype: float64
# Can safely do merges/joins/math because things are still Int
pd.Series([1,2,np.NaN]).astype('Int64')
#0 1
#1 2
#2 <NA>
#dtype: Int64
Why does pandas isnull() work but ==None not work?
As the comment above states, missing data in pandas
is represented by a NaN, where NaN is a numerical value, i.e float type. However None is a Python NoneType
, so NaN will not be equivalent to None.
In [27]: np.nan == None
Out[27]: False
In this Github thread they discuss further, noting:
This was done quite a while ago to make the behavior of nulls consistent, in that they don't compare equal. This puts None and np.nan on an equal (though not-consistent with python, BUT consistent with numpy) footing.
This means when you do df[df['label'] == None]
, you're going elementwise
checking if np.nan == np.nan
, which we know is false.
In [63]: np.nan == np.nan
Out[63]: False
Additionally you should not do df[df['label'] == None]
when you're applying Boolean indexing, using ==
for a NoneType
is not best practice as PEP8 mentions:
Comparisons to singletons like None should always be done with
is
oris not
, never the equality operators.
For example you could do tst.value.apply(lambda x: x is None)
, which yields the same outcome as .isnull()
, illustrating how pandas
treats these as NaNs. Note this is for the below tst
dataframe example, where tst.value.dtypes
is an object
of which I've explicitly specified the NoneType
elements.
There is a nice example in the pandas
docs which illustrate this and it's effect.
For example if you have two columns, one of type float
and the other object
you can see how pandas deals with the None
type in a nice way, notice for float
it is using NaN.
In [32]: tst = pd.DataFrame({"label" : [1, 2, None, 3, None], "value" : ["A", "B", None, "C", None]})
Out[39]:
label value
0 1.0 A
1 2.0 B
2 NaN None
3 3.0 C
4 NaN None
In [51]: type(tst.value[2])
Out[51]: NoneType
In [52]: type(tst.label[2])
Out[52]: numpy.float64
This post explains the difference between NaN and None really well, would definitely take a look at this.
- What is the difference between NaN and None?
Related Topics
Saving a Numpy Array as an Image
Splitting a List into N Parts of Approximately Equal Length
How to Send a Head Http Request in Python 2
Replace Console Output in Python
How to Convert a Utc Datetime to a Local Datetime Using Only Standard Library
How to Check If the String Is Empty
Efficient String Matching in Apache Spark
Python Strings and Integer Concatenation
Python App Does Not Print Anything When Running Detached in Docker
Best Practice for Using Assert
How to Check for Valid Email Address
How to Add a Constant Column in a Spark Dataframe
String Concatenation of Two Pandas Columns
Writing Unicode Text to a Text File
How to Delete Rows from a Pandas Dataframe Based on a Conditional Expression