Pd.Timestamp Versus Np.Datetime64: Are They Interchangeable for Selected Uses

pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?

In my opinion, you should always prefer using a Timestamp - it can easily transform back into a numpy datetime in the case it is needed.

numpy.datetime64 is essentially a thin wrapper for int64. It has almost no date/time specific functionality.

pd.Timestamp is a wrapper around a numpy.datetime64. It is backed by the same int64 value, but supports the entire datetime.datetime interface, along with useful pandas-specific functionality.

The in-array representation of these two is identical - it is a contigous array of int64s. pd.Timestamp is a scalar box that makes working with individual values easier.

Going back to the linked answer, you could write it like this, which is shorter and happens to be faster.

%timeit (df.index.values >= pd.Timestamp('2011-01-02').to_datetime64()) & \
(df.index.values < pd.Timestamp('2011-01-03').to_datetime64())
192 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Strange behaviour when comparing Timestamp and datetime64 in Python2.7

That's an interesting question. I've done some digging around and did my best to explain some of this, although one thing i still don't get is why we get pandas throwing an error instead of numpy when we do b<a.

Regards to your question:

If a can be compared to b, I thought we should be able to compare the other way around?

That's not necesserily true. It just depends on the implementation of the comparison operators.

Take this test class for example:

class TestCom(int):
def __init__(self, a):
self.value = a

def __gt__(self, other):
print('TestComp __gt__ called')
return True

def __eq__(self, other):
return self.a == other

Here I have defined my __gt__ (<) method to always return true no matter what the other value is. While __eq__ (==) left the same.

Now check the following comparisons out:

a = TestCom(9)
print(a)
# Output: 9

# my def of __ge__
a > 100

# Ouput: TestComp __gt__ called
# True

a > '100'
# Ouput: TestComp __gt__ called
# True

'100' < a

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-486-8aee1b1d2500> in <module>()
1 # this will not use my def of __ge__
----> 2 '100' > a

TypeError: '>' not supported between instances of 'str' and 'TestCom'

So going back to your case. Looking at the timestamps_sourceCode the only thing i can think of is pandas.Timestamp does some type checking and conversion if possible.

When we're comparing a with b (pd.Timestamp against np.datetime64), Timestamp.__richcmp__ function does the comparison, if it is of type np.datetime64 then it converts it to pd.Timestamp type and does the comparison.

# we can do the following to have a comparison of say b > a
# this converts a to np.datetime64 - .asm8 is equivalent to .to_datetime64()
b > a.asm8

# or we can confert b to datetime64[ms]
b.astype('datetime64[ms]') > a

# or convert to timestamp
pd.to_datetime(b) > a

What i found surprising was, as i thought the issue is with nanoseconds not in Timestamp, is that even if you do the following the comparison between np.datetime64 with pd.Timestamp fails.

a = pd.Timestamp('2013-03-24 05:32:00.00000001')
a.nanosecond # returns 10
# doing the comparison again where they're both ns still fails
b < a

Looking at the source code it seems like we can use == and != operators. But even they dont work as expected. Take a look at the following for an example:

a = pd.Timestamp('2013-03-24 05:32:00.00000000')
b = np.datetime64('2013-03-24 05:32:00.00000000', 'ns')

b == a # returns False

a == b # returns True

I think this is the result of lines 149-152 or 163-166. Where they return False if your using == and True for !=, without actually comparing the values.

Edit:
The nanosecond feature was added in version 0.23.0. So you can do something like pd.Timestamp('2013-03-23T05:33:00.000000022', unit='ns'). So yes when you compare np.datetime64 it will be converted to pd.Timestamp with ns precision.

Just note that pd.Timestamp is supposed to be a replacement for python`s datetime:

Timestamp is the pandas equivalent of python's Datetime
and is interchangeable with it in most cases.

But python's datetime doesn't support nanoseconds - good answer here explaining why SO_Datetime.pd.Timestamp have support for comparison between the two even if your Timestamp has nanoseconds in it. When you compare a datetime object agains pd.Timestamp object with ns they have _compare_outside_nanorange that will do the comparison.

Going back to np.datetime64, one thing to note here as explained nicely in this post SO is that it's a wrapper on an int64 type. So not suprising if i do the following:

1 > a
a > 1

Both will though an error Cannot compare type 'Timestamp' with type 'int'.

So under the hood when you do b > a the comparison most be done on an int level, this comparison will be done by np.greater() function np.greater - also take a look at ufunc_docs.

Note: I'm unable to confirm this, the numpy docs are too complex to go through. If any numpy experts can comment on this, that'll be helpful.

If this is the case, if the comparison of np.datetime64 is based on int, then the example above with a == b and b == a makes sense. Since when we do b == a we compare the int value of b against pd.Timestamp this will always return Flase for == and True for !=.

Its the same as doing say 123 == '123', this operation will not fail, it will just return False.

Pandas DatetimeIndex indexing dtype: datetime64 vs Timestamp

You are using numpy functions to manipulate pandas types. They are not always compatible.

The function np.in1d first converts its both arguments to ndarrays. A DatetimeIndex has a built-in conversion and an array of dtype np.datetime64 is returned (it's DatetimIndex.values). But a Timestamp doesn't have such a facility and it's not converted.

Instead, you can use for example a python keyword in (the most natural way):

a_datetimeindex[0] in a_datetimeindex

or an Index.isin method for a collection of elements

a_datetimeindex.isin(a_list_or_index)

If you want to use np.in1d, explicitly convert both arguments to numpy types. Or call it on the underlying numpy arrays:

np.in1d(a_datetimeindex.values[0], a_datetimeindex.values)

Alternatively, it's probably safe to use np.in1d with two collections of the same type:

np.in1d(a_datetimeindex, another_datetimeindex)

or even

np.in1d(a_datetimeindex[[0]], a_datetimeindex)

Pandas Timestamp to datetime.datetime()

You can use pd.DatetimeIndex and its difference method. In general, using set with Pandas / NumPy objects is inefficient. Related: Pandas pd.Series.isin performance with set versus array.

from datetime import datetime

df = pd.DataFrame({"my_date": [pd.Timestamp('2019-01-01 00:00:00', tz=None),
pd.Timestamp('2019-01-10 00:00:00', tz=None)]})

datetime_list = [datetime(2019, 1, 1, 0, 0, 0)]

diff = pd.DatetimeIndex(df['my_date']).difference(pd.DatetimeIndex(datetime_list))

# DatetimeIndex(['2019-01-10'], dtype='datetime64[ns]', freq=None)

Numpy.minimum with Pandas.Series of Timestamps TypeError: Cannot compare 'Timestamp' with 'int'

To do short, instead of using pd.to_datetime to create the upper bound, use np.datetime64

s = pd.Series([pd.to_datetime('2018-01-16 21:44:00'), pd.to_datetime('2018-01-16 21:41:00')])
print (np.minimum(s, np.datetime64('2018-01-16 21:43:00')))
0 2018-01-16 21:43:00
1 2018-01-16 21:41:00
dtype: datetime64[ns]

or even this np.minimum(s, pd.to_datetime('2018-01-16 21:43:00').to_datetime64()) works.

To see a bit more: If you have a look at both dtype or even the element representation of the two way you create your data, you can see the differences.

print (s.values)
array(['2018-01-16T21:44:00.000000000', '2018-01-16T21:41:00.000000000'],
dtype='datetime64[ns]')
print (np.array([pd.to_datetime('2018-01-16 21:44:00'), pd.to_datetime('2018-01-16 21:41:00')]))
array([Timestamp('2018-01-16 21:44:00'), Timestamp('2018-01-16 21:41:00')],
dtype=object)

One way interesting is to change the type of s.values such as:

print (np.minimum(s.values.astype('datetime64[s]'), 
pd.to_datetime('2018-01-16 21:43:00')))
array([Timestamp('2018-01-16 21:43:00'),
datetime.datetime(2018, 1, 16, 21, 41)], dtype=object)

it works but you can see that one is a Timestamp and the other one is datetime, it seems that when the type of s.values is datetime[ns] the comparison is not possible while datetime64[s] or even datetime64[ms] it does.

Also have a look at this answer, it may help.

Is pandas.Timestamp immutable?

After looking to the source code, I found out the inherit from datetime.datetime which is immutable.

# in pandas/_lib/tslibs/timestamp.pyx
cdef class _Timestamp(datetime):
# ...
class Timestamp(_Timestamp): # This is the class that is exported

If you look inside the python implementation of datetime you see that it is supposed to be immutable (via onway properties):

# Read-only field accessors
@property
def year(self):
"""year (1-9999)"""
return self._year

@property
def month(self):
"""month (1-12)"""
return self._month

@property
def day(self):
"""day (1-31)"""
return self._day

@property
def hour(self):
"""hour (0-23)"""
return self._hour

@property
def minute(self):
"""minute (0-59)"""
return self._minute

@property
def second(self):
"""second (0-59)"""
return self._second

@property
def microsecond(self):
"""microsecond (0-999999)"""
return self._microsecond

TypeError: Cannot compare type 'Timestamp' with type 'date'

this converts it to date:

data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y").dt.date

however i would not recommend filtering like this.
this is much faster

data_entries[data_entries['VOUCHER DATE'].between(start_date, end_date)]

read this article

How to convert from pandas.DatetimeIndex to numpy.datetime64?

The data inside is of datetime64 dtype (datetime64[ns] to be precise). Just take the values attribute of the index. Note it will be nanosecond unit.

Pandas optimise datetime comparison on two columns

The bottleneck is construction of the Boolean series / array used for indexing.

Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?

# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')

x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())

# check results are the same
assert np.array_equal(x, y)

%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop

%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop

Setup

np.random.seed(0)

def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df

# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)


Related Topics



Leave a reply



Submit