pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?
In my opinion, you should always prefer using a Timestamp
- it can easily transform back into a numpy datetime in the case it is needed.
numpy.datetime64
is essentially a thin wrapper for int64
. It has almost no date/time specific functionality.
pd.Timestamp
is a wrapper around a numpy.datetime64
. It is backed by the same int64 value, but supports the entire datetime.datetime
interface, along with useful pandas-specific functionality.
The in-array representation of these two is identical - it is a contigous array of int64s. pd.Timestamp
is a scalar box that makes working with individual values easier.
Going back to the linked answer, you could write it like this, which is shorter and happens to be faster.
%timeit (df.index.values >= pd.Timestamp('2011-01-02').to_datetime64()) & \
(df.index.values < pd.Timestamp('2011-01-03').to_datetime64())
192 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Strange behaviour when comparing Timestamp and datetime64 in Python2.7
That's an interesting question. I've done some digging around and did my best to explain some of this, although one thing i still don't get is why we get pandas
throwing an error instead of numpy
when we do b<a
.
Regards to your question:
That's not necesserily true. It just depends on the implementation of the comparison operators.If a can be compared to b, I thought we should be able to compare the other way around?
Take this test class for example:
class TestCom(int):
def __init__(self, a):
self.value = a
def __gt__(self, other):
print('TestComp __gt__ called')
return True
def __eq__(self, other):
return self.a == other
Here I have defined my __gt__
(<
) method to always return true no matter what the other value is. While __eq__
(==
) left the same.Now check the following comparisons out:
a = TestCom(9)
print(a)
# Output: 9
# my def of __ge__
a > 100
# Ouput: TestComp __gt__ called
# True
a > '100'
# Ouput: TestComp __gt__ called
# True
'100' < a
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-486-8aee1b1d2500> in <module>()
1 # this will not use my def of __ge__
----> 2 '100' > a
TypeError: '>' not supported between instances of 'str' and 'TestCom'
So going back to your case. Looking at the timestamps_sourceCode the only thing i can think of is pandas.Timestamp
does some type checking and conversion if possible.When we're comparing a with b (pd.Timestamp
against np.datetime64
), Timestamp.__richcmp__
function does the comparison, if it is of type np.datetime64
then it converts it to pd.Timestamp
type and does the comparison.
# we can do the following to have a comparison of say b > a
# this converts a to np.datetime64 - .asm8 is equivalent to .to_datetime64()
b > a.asm8
# or we can confert b to datetime64[ms]
b.astype('datetime64[ms]') > a
# or convert to timestamp
pd.to_datetime(b) > a
What i found surprising was, as i thought the issue is with nanoseconds
not in Timestamp, is that even if you do the following the comparison between np.datetime64 with pd.Timestamp fails.a = pd.Timestamp('2013-03-24 05:32:00.00000001')
a.nanosecond # returns 10
# doing the comparison again where they're both ns still fails
b < a
Looking at the source code it seems like we can use ==
and !=
operators. But even they dont work as expected. Take a look at the following for an example:a = pd.Timestamp('2013-03-24 05:32:00.00000000')
b = np.datetime64('2013-03-24 05:32:00.00000000', 'ns')
b == a # returns False
a == b # returns True
I think this is the result of lines 149-152 or 163-166. Where they return False
if your using ==
and True
for !=
, without actually comparing the values. Edit:
The nanosecond
feature was added in version 0.23.0
. So you can do something like pd.Timestamp('2013-03-23T05:33:00.000000022', unit='ns')
. So yes when you compare np.datetime64
it will be converted to pd.Timestamp
with ns
precision.
Just note that pd.Timestamp
is supposed to be a replacement for python`s datetime:
But python's datetime doesn't support nanoseconds - good answer here explaining why SO_Datetime.Timestamp is the pandas equivalent of python's Datetime
and is interchangeable with it in most cases.
pd.Timestamp
have support for comparison between the two even if your Timestamp
has nanoseconds in it. When you compare a datetime
object agains pd.Timestamp
object with ns
they have _compare_outside_nanorange that will do the comparison.Going back to np.datetime64
, one thing to note here as explained nicely in this post SO is that it's a wrapper on an int64
type. So not suprising if i do the following:
1 > a
a > 1
Both will though an error Cannot compare type 'Timestamp' with type 'int'
.So under the hood when you do b > a
the comparison most be done on an int
level, this comparison will be done by np.greater()
function np.greater - also take a look at ufunc_docs.
If this is the case, if the comparison ofNote: I'm unable to confirm this, the numpy docs are too complex to go through. If any numpy experts can comment on this, that'll be helpful.
np.datetime64
is based on int
, then the example above with a == b
and b == a
makes sense. Since when we do b == a
we compare the int
value of b
against pd.Timestamp
this will always return Flase
for ==
and True
for !=
. Its the same as doing say 123 == '123'
, this operation will not fail, it will just return False
.
Pandas DatetimeIndex indexing dtype: datetime64 vs Timestamp
You are using numpy functions to manipulate pandas types. They are not always compatible.
The function np.in1d
first converts its both arguments to ndarrays. A DatetimeIndex
has a built-in conversion and an array of dtype np.datetime64
is returned (it's DatetimIndex.values
). But a Timestamp
doesn't have such a facility and it's not converted.
Instead, you can use for example a python keyword in
(the most natural way):
a_datetimeindex[0] in a_datetimeindex
or an Index.isin
method for a collection of elementsa_datetimeindex.isin(a_list_or_index)
If you want to use np.in1d
, explicitly convert both arguments to numpy types. Or call it on the underlying numpy arrays:np.in1d(a_datetimeindex.values[0], a_datetimeindex.values)
Alternatively, it's probably safe to use np.in1d
with two collections of the same type:np.in1d(a_datetimeindex, another_datetimeindex)
or evennp.in1d(a_datetimeindex[[0]], a_datetimeindex)
Pandas Timestamp to datetime.datetime()
You can use pd.DatetimeIndex
and its difference
method. In general, using set
with Pandas / NumPy objects is inefficient. Related: Pandas pd.Series.isin performance with set versus array.
from datetime import datetime
df = pd.DataFrame({"my_date": [pd.Timestamp('2019-01-01 00:00:00', tz=None),
pd.Timestamp('2019-01-10 00:00:00', tz=None)]})
datetime_list = [datetime(2019, 1, 1, 0, 0, 0)]
diff = pd.DatetimeIndex(df['my_date']).difference(pd.DatetimeIndex(datetime_list))
# DatetimeIndex(['2019-01-10'], dtype='datetime64[ns]', freq=None)
Numpy.minimum with Pandas.Series of Timestamps TypeError: Cannot compare 'Timestamp' with 'int'
To do short, instead of using pd.to_datetime
to create the upper bound, use np.datetime64
s = pd.Series([pd.to_datetime('2018-01-16 21:44:00'), pd.to_datetime('2018-01-16 21:41:00')])
print (np.minimum(s, np.datetime64('2018-01-16 21:43:00')))
0 2018-01-16 21:43:00
1 2018-01-16 21:41:00
dtype: datetime64[ns]
or even this np.minimum(s, pd.to_datetime('2018-01-16 21:43:00').to_datetime64())
works.To see a bit more: If you have a look at both dtype
or even the element representation of the two way you create your data, you can see the differences.
print (s.values)
array(['2018-01-16T21:44:00.000000000', '2018-01-16T21:41:00.000000000'],
dtype='datetime64[ns]')
print (np.array([pd.to_datetime('2018-01-16 21:44:00'), pd.to_datetime('2018-01-16 21:41:00')]))
array([Timestamp('2018-01-16 21:44:00'), Timestamp('2018-01-16 21:41:00')],
dtype=object)
One way interesting is to change the type of s.values
such as:print (np.minimum(s.values.astype('datetime64[s]'),
pd.to_datetime('2018-01-16 21:43:00')))
array([Timestamp('2018-01-16 21:43:00'),
datetime.datetime(2018, 1, 16, 21, 41)], dtype=object)
it works but you can see that one is a Timestamp
and the other one is datetime
, it seems that when the type of s.values
is datetime[ns]
the comparison is not possible while datetime64[s]
or even datetime64[ms]
it does. Also have a look at this answer, it may help.
Is pandas.Timestamp immutable?
After looking to the source code, I found out the inherit from datetime.datetime
which is immutable.
# in pandas/_lib/tslibs/timestamp.pyx
cdef class _Timestamp(datetime):
# ...
class Timestamp(_Timestamp): # This is the class that is exported
If you look inside the python implementation of datetime
you see that it is supposed to be immutable (via onway properties):# Read-only field accessors
@property
def year(self):
"""year (1-9999)"""
return self._year
@property
def month(self):
"""month (1-12)"""
return self._month
@property
def day(self):
"""day (1-31)"""
return self._day
@property
def hour(self):
"""hour (0-23)"""
return self._hour
@property
def minute(self):
"""minute (0-59)"""
return self._minute
@property
def second(self):
"""second (0-59)"""
return self._second
@property
def microsecond(self):
"""microsecond (0-999999)"""
return self._microsecond
TypeError: Cannot compare type 'Timestamp' with type 'date'
this converts it to date:
data_entries['VOUCHER DATE'] = pd.to_datetime(data_entries['VOUCHER DATE'], format="%m/%d/%Y").dt.date
however i would not recommend filtering like this.this is much faster
data_entries[data_entries['VOUCHER DATE'].between(start_date, end_date)]
read this article How to convert from pandas.DatetimeIndex to numpy.datetime64?
The data inside is of datetime64
dtype (datetime64[ns]
to be precise). Just take the values
attribute of the index. Note it will be nanosecond unit.
Pandas optimise datetime comparison on two columns
The bottleneck is construction of the Boolean series / array used for indexing.
Dropping down to NumPy seems to give a reasonable (~2x) performance improvement. See related: pd.Timestamp versus np.datetime64: are they interchangeable for selected uses?
# boundaries for testing
mindt = pd.to_datetime('2016-01-01')
maxdt = pd.to_datetime('2017-01-01')
x = ((df['start'] <= mindt) & (df['end'] >= maxdt)).values
y = (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# check results are the same
assert np.array_equal(x, y)
%timeit (df['start'].values <= mindt.to_datetime64()) & (df['end'].values >= maxdt.to_datetime64())
# 55.6 ms per loop
%timeit (df['start'] <= mindt) & (df['end'] >= maxdt)
# 108 ms per loop
Setupnp.random.seed(0)
def random_dates(start, end, n):
start_u = start.value//10**9
end_u = end.value//10**9
cols = ['start', 'end']
df = pd.DataFrame({col: pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s') for col in cols})
df = pd.DataFrame(np.sort(df.values, axis=1), columns=cols)
df[cols] = df[cols].apply(pd.to_datetime, errors='raise')
return df
# construct a dataframe of random dates
df = random_dates(pd.to_datetime('2015-01-01'), pd.to_datetime('2018-01-01'), 10**7)
Related Topics
Reading Tar File Contents Without Untarring It, in Python Script
Basic Program to Convert Integer to Roman Numerals
Pandas Dataframe Column to List
Does Python Evaluate If's Conditions Lazily
Extract Int from String in Pandas
How to Include Image Files in Django Templates
Disable Console Messages in Flask Server
Adding a Y-Axis Label to Secondary Y-Axis in Matplotlib
Typeerror: Got Multiple Values for Argument
How to Set Opacity of Background Colour of Graph with Matplotlib
Slicing of a Numpy 2D Array, or How to Extract an Mxm Submatrix from an Nxn Array (N>M)
Serve Image Stored in SQLalchemy Largebinary Column
Most Efficient Way to Sort an Array into Bins Specified by an Index Array