Differencebetween Nan and None

What is the difference between NaN and None?

NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.

Wes writes in the docs 'choice of NA-representation':

After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions isnull and notnull which can be used across the dtypes to detect NA values.

...

Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.

Note: the "gotcha" that integer Series containing missing data are upcast to floats.

In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.

#  without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype=object)
s_good = pd.Series([1, np.nan])

In [13]: s_bad.dtype
Out[13]: dtype('O')

In [14]: s_good.dtype
Out[14]: dtype('float64')

Jeff comments (below) on this:

np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.

So repeat 3 times fast: object==bad, float==good

Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):

In [15]: s_bad.sum()
Out[15]: 1

In [16]: s_good.sum()
Out[16]: 1.0

To answer the second question:

You should be using pd.isnull and pd.notnull to test for missing data (NaN).

How to make pandas discern the difference between None and NaN in python?

If pandas interprets a column's dtype to be numeric, then all nulls None or np.nan will become np.nan. The only way for pandas to preserve None and np.nan in the same column is to have the dtype be object. However, it is important to point out that if the dtype is object you lose many of the benefits of having a numeric dtype like efficient calculations.

pd.Series([1, None, np.nan, 2])

0    1.0
1    NaN
2    NaN
3    2.0
dtype: float64

pd.Series([1, None, np.nan, 2], dtype=object)

0       1
1    None
2     NaN
3       2
dtype: object

s1 = pd.Series([1, None, np.nan, 2])
s2 = pd.Series([1, None, np.nan, 2], dtype=object)

%timeit s1 + 1
%timeit s2 + 1

68 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
169 µs ± 5.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Differences between null and NaN in spark? How to deal with it?

null values represents "no value" or "nothing", it's not even an empty string or zero. It can be used to represent that nothing useful exists.

NaN stands for "Not a Number", it's usually the result of a mathematical operation that doesn't make sense, e.g. 0.0/0.0.

One possible way to handle null values is to remove them with:

df.na.drop()

Or you can change them to an actual value (here I used 0) with:

df.na.fill(0)

Another way would be to select the rows where a specific column is null for further processing:

df.where(col("a").isNull())
df.where(col("a").isNotNull())

Rows with NaN can also be selected using the equivalent method:

from pyspark.sql.functions import isnan
df.where(isnan(col("a")))

Why does pandas use NaN from numpy, instead of its own null value?

A main dependency of pandas is numpy, in other words, pandas is built on-top of numpy. Because pandas inherits and uses many of the numpy methods, it makes sense to keep things consistent, that is, missing numeric data are represented with np.NaN.

(This choice to build upon numpy has consequences for other things too. For instance date and time operations are built upon the np.timedelta64 and np.datetime64 dtypes, not the standard datetime module.)

One thing you may not have known is that numpy has always been there with pandas

import pandas as pd
pd.np?
pd.np.nan

Though you might think this behavior could be better since you don't import numpy, this is discouraged and in the near future will be deprecated in favor of directly importing numpy

FutureWarning: The pandas.np module is deprecated and will be removed
from pandas in a future version. Import numpy directly instead

Is it conventional to use np.nan (rather than None) to represent null values in pandas?

If the data are numeric then yes, you should use np.NaN. None requires the dtype to be Object and with pandas you want numeric data stored in a numeric dtype. pandas will generally coerce to the proper null-type upon creation or import so that it can use the correct dtype

pd.Series([1, None])
#0    1.0
#1    NaN        <- None became NaN so it can have dtype: float64
#dtype: float64

Why did pandas not have its own null value for most of its lifetime (until last year)? What was the motivation for adding?

pandas did not have it's own null value because it got by with np.NaN, which worked for the majority of circumstances. However with pandas it's very common to have missing data, an entire section of the documentation is devoted to this. NaN, being a float, does not fit into an integer container which means that any numeric Series with missing data is upcast to float. This can become problematic because of floating point math, and some integers cannot be represented perfectly with by a floating point number. As a result, any joins or merges could possibly fail.

# Gets upcast to float
pd.Series([1,2,np.NaN])
#0    1.0
#1    2.0
#2    NaN
#dtype: float64

# Can safely do merges/joins/math because things are still Int
pd.Series([1,2,np.NaN]).astype('Int64')
#0       1
#1       2
#2    <NA>
#dtype: Int64

Why does pandas isnull() work but ==None not work?

As the comment above states, missing data in pandas is represented by a NaN, where NaN is a numerical value, i.e float type. However None is a Python NoneType, so NaN will not be equivalent to None.

In [27]: np.nan == None
Out[27]: False

In this Github thread they discuss further, noting:

This was done quite a while ago to make the behavior of nulls consistent, in that they don't compare equal. This puts None and np.nan on an equal (though not-consistent with python, BUT consistent with numpy) footing.

This means when you do df[df['label'] == None], you're going elementwise checking if np.nan == np.nan, which we know is false.

In [63]: np.nan == np.nan
Out[63]: False

Additionally you should not do df[df['label'] == None] when you're applying Boolean indexing, using == for a NoneType is not best practice as PEP8 mentions:

Comparisons to singletons like None should always be done with is or is not, never the equality operators.

For example you could do tst.value.apply(lambda x: x is None), which yields the same outcome as .isnull(), illustrating how pandas treats these as NaNs. Note this is for the below tst dataframe example, where tst.value.dtypes is an object of which I've explicitly specified the NoneType elements.

There is a nice example in the pandas docs which illustrate this and it's effect.

For example if you have two columns, one of type float and the other object you can see how pandas deals with the None type in a nice way, notice for float it is using NaN.

In [32]: tst = pd.DataFrame({"label" : [1, 2, None, 3, None], "value" : ["A", "B", None, "C", None]})

Out[39]:
   label value
0    1.0     A
1    2.0     B
2    NaN  None
3    3.0     C
4    NaN  None

In [51]: type(tst.value[2])
Out[51]: NoneType

In [52]: type(tst.label[2])
Out[52]: numpy.float64

This post explains the difference between NaN and None really well, would definitely take a look at this.

What is the difference between NaN and None?