Pandas To_Datetime Parsing Wrong Year

pandas to_datetime parsing wrong year

That seems to be the behavior of the Python library datetime, I did a test to see where the cutoff is 68 - 69:

datetime.datetime.strptime('31-Dec-68', '%d-%b-%y').date()
>>> datetime.date(2068, 12, 31)

datetime.datetime.strptime('1-Jan-69', '%d-%b-%y').date()
>>> datetime.date(1969, 1, 1)

Two digits year ambiguity

So it seems that anything with the %y year below 69 will be attributed a century of 2000, and 69 upwards get 1900

The %y two digits can only go from 00 to 99 which is going to be ambiguous if we start crossing centuries.

If there is no overlap, you could manually process it and annotate the century (kill the ambiguity)

I suggest you process your data manually and specify the century, e.g. you can decide that anything in your data that has the year between 17 and 68 is attributed to 1917 - 1968 (instead of 2017 - 2068).

If you have overlap then you can't process with insufficient year information, unless e.g. you have some ordered data and a reference

If you have overlap e.g. you have data from both 2016 and 1916 and both were logged as '16', that's ambiguous and there isn't sufficient information to parse this, unless the data is ordered by date in which case you can use heuristics to switch the century as you parse it.

Pandas to_datetime changes year unexpectedly

Use:

df['date'] = pd.to_datetime(df['date'].str[:-2] + '19' + df['date'].str[-2:])

Another solution with replace:

df['date'] = pd.to_datetime(df['date'].str.replace(r'-(\d+)$', r'-19\1'))

Sample:

print (df)
date
0 01-06-70
1 01-06-69
2 01-06-68
3 01-06-67

df['date'] = pd.to_datetime(df['date'].str.replace(r'-(\d+)$', r'-19\1'))
print (df)
date
0 1970-01-06
1 1969-01-06
2 1968-01-06
3 1967-01-06

When converting into datetime why is the result parsing wrong year and month using pandas?

You can add origin parameter to to_datetime:

df1['a_final']=pd.to_datetime(df1['a'],unit='D',origin='1899-12-30').dt.strftime("%d/%m/%Y")
print (df1)
a a_final
0 44140 05/11/2020
1 44266 11/03/2021
2 44266 11/03/2021
3 44265 10/03/2021
4 44265 10/03/2021
39640 44143 08/11/2020
39641 44109 05/10/2020
39642 44232 05/02/2021
39643 44125 21/10/2020
39644 44222 26/01/2021

pandas to_datetime converting 71 to 2071 instead of 1971

The year column is very ambiguous since a century isn't declared Python's behavior will interpret the dates as such. You can read the reasoning here.

There is a partial solution found here. You would basically offset the years by 100 (a century) to fix this issue. This will be a janky fix. You would want to implement this after getting your second dataframe.

import pandas as pd
import numpy as np

df['Date'] = np.where(df['Date'].dt.year > 2022, df['Date'] - pd.offsets.DateOffset(years=100), df['Date'])
# Anything after 2022 is changed to have 100 years subtracted because 2022 is the current year, change it as the years progress

Pandas pandas.to_datetime(), incorrect parsing

Your format string is wrong:

"%Y%M%d"

%M means minutes which is why your month defaulted to 1 and you have minutes in your datetimes.

Use:

"%Y%m%d"

See the docs for the correct format specifiers

pd.to_datetime errors = 'ignore' strange behavior

If errors is set to ignore, then invalid parsing will return the input. So in your case the input is result["Action"](The entire column).

The solution to this problem is to apply pd.to_datetime rowwise with errors='ignore'. By doing so you will get the same row back if the row does not follow the format.

>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'Action': ['Tuesday November 30 2021', 'Appointment time clicked']})
>>> df
Action
0 Tuesday November 30 2021
1 Appointment time clicked
>>>
>>> def custom(action):
... date_time = pd.to_datetime(action, format='%A %B %d %Y', errors='ignore')
... return date_time
...
>>> df.Action = df.Action.apply(custom)
>>> df
Action
0 2021-11-30 00:00:00
1 Appointment time clicked

Pandas Converting to Datetime, dateutilparser error

There are some bad values in time column like 84, so use errors='coerce' for convert them to NaT.

df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')

Pandas - Datetime format change to '%m/%d/%Y'

The reason you have to use errors="ignore" is because not all the dates you are parsing are in the correct format. If you use errors="coerce" like @phi has mentioned then any dates that cannot be converted will be set to NaT. The columns datatype will still be converted to datatime64 and you can then format as you like and deal with the NaT as you want.

Example

A dataframe with one item in Date not written as Year/Month/Day (25th Month is wrong):

>>> df = pd.DataFrame({'ID': [91060, 91061, 91062, 91063], 'Date': ['2017/11/10', '2022/05/01', '2022/04/01', '2055/25/25']})
>>> df
ID Date
0 91060 2017/11/10
1 91061 2022/05/01
2 91062 2022/04/01
3 91063 2055/25/25

>>> df.dtypes
ID int64
Date object
dtype: object

Using errors="ignore":

>>> df['Date'] = pd.to_datetime(df['Date'], errors='ignore')
>>> df
ID Date
0 91060 2017/11/10
1 91061 2022/05/01
2 91062 2022/04/01
3 91063 2055/25/25

>>> df.dtypes
ID int64
Date object
dtype: object

Column Date is still an object because not all the values have been converted. Running df['Date'] = df['Date'].dt.strftime("%m/%d/%Y") will result in the AttributeError

Using errors="coerce":

>>> df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
>>> df
ID Date
0 91060 2017-11-10
1 91061 2022-05-01
2 91062 2022-04-01
3 91063 NaT

>>> df.dtypes
ID int64
Date datetime64[ns]
dtype: object

Invalid dates are set to NaT and the column is now of type datatime64 and you can now format it:

>>> df['Date'] = df['Date'].dt.strftime("%m/%d/%Y")
>>> df
ID Date
0 91060 11/10/2017
1 91061 05/01/2022
2 91062 04/01/2022
3 91063 NaN

Note: When formatting datatime64, it is converted back to type object so NaT's are changed to NaN. The issue you are having is a case of some dirty data not in the correct format.



Related Topics



Leave a reply



Submit