Can Pandas Automatically Read Dates from a CSV File

Can pandas automatically read dates from a CSV file?

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column 'datetime' with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.

Reading CSV dates with pandas returns datetime instead of Timestamp

You can specify which date_parser function to be used:

data = pd.read_csv('temp.csv', 
parse_dates = ["Local time"],
date_parser=pd.Timestamp)

Output:

>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-03:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-02:00 1.37356 1.37820 1.37211 1.37421 124863.8203

>>> type(data['Local time'][0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

By my observation pandas automatically parses each entry as datetime when timezone are different for individual observation.

The above should work if you really need to use pd.Timestamp.

Running the above however also gives me a FutureWarning, which I researched and found to be harmless as of now.

EDIT

After a bit more research:

pandas tries to convert a date type column to DatetimeIndex for more efficiency in datetime based operations.
But for this pandas needs to have a common timezone for the entire column.

On explicitly trying to convert to pd.DatetimeIndex

>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-03:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-04:00 1.37356 1.37820 1.37211 1.37421 124863.8203

>>> pd.DatetimeIndex(data['Local time'])

ValueError: Array must be all same time zone

During handling of the above exception, another exception occurred:

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

So when converting to DatetimeIndex fails pandas then keeps the data as strings (dtype : object) internally and individual entries to be processed as datetime.

Documentation recommends that if timezones in the data are different specify UTC=True, so the timezone would be set as UTC and time values would be changed accordingly.

From Documentation:

pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a mixture of timezones, the default result will be an object-dtype column with strings, even with parse_dates.

To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with utc=True

In a data that already has the same timezone DatetimeIndex works seamlessly:

>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-02:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-02:00 1.37356 1.37820 1.37211 1.37421 124863.8203

>>> pd.DatetimeIndex(data['Local time'])

DatetimeIndex(['2014-02-03 02:00:00-02:00', '2014-03-03 02:00:00-02:00',
'2014-04-03 02:00:00-02:00'],
dtype='datetime64[ns, pytz.FixedOffset(-120)]', name='Local time', freq=None)

>>> type(pd.DatetimeIndex(data['Local time'])[0])

<class 'pandas._libs.tslibs.timestamps.Timestamp'>

References:

  • https://pandas.pydata.org/docs/user_guide/io.html#io-csv-mixed-timezones
  • https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html
  • https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#parse_dates

Parsing date in pandas.read_csv

Just specify a list of columns that should be convert to dates in the parse_dates= of pd.read_csv:

>>> df = pd.read_csv('file.csv', parse_dates=['date'])
>>> df
date a b c d
0 2021-12-30 1.1 1.2 1.3 1

>>> df.dtypes
date datetime64[ns]
a float64
b float64
c float64
d int64

how to read data from csv as date format in python pandas

Try this

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%m/%d/%Y')

df = pd.read_csv('history.csv', parse_dates=['Month'], date_parser=dateparse)

Pandas changes date format while reading csv file altough format in the file does not change

Use dayfirst=True as parameter of read_csv:

df = pd.read_csv('Test_Read_Date.csv', sep=';',
parse_dates=['timestamp'], dayfirst=True)

Output

>>> df

timestamp temperatures
0 2021-06-07 22:00:00 17.00
1 2021-06-07 22:15:00 16.88
2 2021-06-07 22:30:00 16.75
3 2021-06-07 22:45:00 16.63
4 2021-06-07 23:00:00 16.50
... ... ...
9699 2021-09-16 22:45:00 13.25
9700 2021-09-16 23:00:00 13.40
9701 2021-09-16 23:15:00 13.33
9702 2021-09-16 23:30:00 13.25
9703 2021-09-16 23:45:00 13.18

[9704 rows x 2 columns]

>>> df.loc[487:488]
timestamp temperatures
487 2021-06-12 23:45:00 18.38
488 2021-06-13 00:00:00 18.30

Can pandas format individual dates in a csv file?

Usually, pd.to_datetime() is smart enough to infer the format on its own. To convert a series or a column of the dataframe to the datetime format you can use:

df["date"] = pd.to_datetime(df["date"]) 

You can then convert the series back to a string with the desired format:

df["date"].dt.strftime('%Y-%m-%d')

When working with (multiple) unusual formats you might need to use a different method, see this similar question.



Related Topics



Leave a reply



Submit