Read CSV with Dates and Numbers

Can pandas automatically read dates from a CSV file?

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column 'datetime' with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.

read.csv in R reading dates differently

This can be solved by reading in the dates as a character vector and then calling strptime() inside transform():

transform(read.csv(path2,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y')));
##         Date  Open  High   Low Close  Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400  71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600  72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400  68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200  65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400  67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600  66.41976

Edit: You can try to "detect" the date format dynamically using your own assumptions, but this will only be as reliable as your assumptions:

readStockData <- function(path) {
    tab <- read.csv(path,colClasses=c('character',rep('numeric',6)));
    tab$Date <- as.Date(tab$Date,if (grepl('^\\d+/\\d+/\\d+$',tab$Date[1])) '%m/%d/%Y' else '%Y-%m-%d');
    tab;
};
readStockData(path1);
##         Date   Open   High    Low  Close  Volume Adj.Close
## 1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
## 2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
## 3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
## 4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
## 5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
## 6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461
readStockData(path2);
##         Date  Open  High   Low Close  Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400  71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600  72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400  68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200  65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400  67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600  66.41976

In the above I've made the assumption that there is at least one record in the file and that all records use the same Date format, thus the first Date value (tab$Date[1]) can be used for the detection.

How to read in a number from a csv as a time in Pandas

Actually I think this answer is the one I want:
Convert number to time in Python

All I'm looking for is the time, and the date part of the datetime is irrelevant for my needs.

Also this answer to help pad some of those values that are only two digits out to 4: Add Leading Zeros to Strings in Pandas Dataframe

Reading CSV dates with pandas returns datetime instead of Timestamp

You can specify which date_parser function to be used:

data = pd.read_csv('temp.csv', 
                   parse_dates = ["Local time"],
                   date_parser=pd.Timestamp)

Output:

>>> data
                  Local time     Open     High      Low    Close       Volume
0  2014-02-03 02:00:00-02:00  1.37620  1.37882  1.37586  1.37745    5616.0400
1  2014-03-03 02:00:00-03:00  1.37745  1.37928  1.37264  1.37357  136554.6563
2  2014-04-03 02:00:00-02:00  1.37356  1.37820  1.37211  1.37421  124863.8203

>>> type(data['Local time'][0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

By my observation pandas automatically parses each entry as datetime when timezone are different for individual observation.

The above should work if you really need to use pd.Timestamp.

Running the above however also gives me a FutureWarning, which I researched and found to be harmless as of now.

EDIT

After a bit more research:

pandas tries to convert a date type column to DatetimeIndex for more efficiency in datetime based operations.
But for this pandas needs to have a common timezone for the entire column.

On explicitly trying to convert to pd.DatetimeIndex

>>> data
                  Local time     Open     High      Low    Close       Volume
0  2014-02-03 02:00:00-02:00  1.37620  1.37882  1.37586  1.37745    5616.0400
1  2014-03-03 02:00:00-03:00  1.37745  1.37928  1.37264  1.37357  136554.6563
2  2014-04-03 02:00:00-04:00  1.37356  1.37820  1.37211  1.37421  124863.8203

>>> pd.DatetimeIndex(data['Local time'])

ValueError: Array must be all same time zone

During handling of the above exception, another exception occurred:

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

So when converting to DatetimeIndex fails pandas then keeps the data as strings (dtype : object) internally and individual entries to be processed as datetime.

Documentation recommends that if timezones in the data are different specify UTC=True, so the timezone would be set as UTC and time values would be changed accordingly.

From Documentation:

pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a mixture of timezones, the default result will be an object-dtype column with strings, even with parse_dates.

To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with utc=True

In a data that already has the same timezone DatetimeIndex works seamlessly:

>>> data
                 Local time     Open     High      Low    Close       Volume
0 2014-02-03 02:00:00-02:00  1.37620  1.37882  1.37586  1.37745    5616.0400
1 2014-03-03 02:00:00-02:00  1.37745  1.37928  1.37264  1.37357  136554.6563
2 2014-04-03 02:00:00-02:00  1.37356  1.37820  1.37211  1.37421  124863.8203

>>> pd.DatetimeIndex(data['Local time'])

DatetimeIndex(['2014-02-03 02:00:00-02:00', '2014-03-03 02:00:00-02:00',
               '2014-04-03 02:00:00-02:00'],
              dtype='datetime64[ns, pytz.FixedOffset(-120)]', name='Local time', freq=None)

>>> type(pd.DatetimeIndex(data['Local time'])[0])

<class 'pandas._libs.tslibs.timestamps.Timestamp'>

References:

https://pandas.pydata.org/docs/user_guide/io.html#io-csv-mixed-timezones
https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#parse_dates

How to import date from CSV in dd/mm/yyyy hh:mm format in to general number string in R

One potential solution could be to use lubridate to parse the text strings of the date/time columns after import. From this you can extract the date and time (using date() and hms::as_hms()):

library(readr)
library(dplyr)
library(lubridate)

read_csv("Date_time\n
         01/09/2021 19:30\n
         19/12/2020 12:45\n
         16/03/2019 00:15") %>% 
  mutate(Date_time = dmy_hm(Date_time),
         Date = date(Date_time),
         Time = hms::as_hms(Date_time))

#> # A tibble: 3 x 3
#>   Date_time           Date       Time  
#>   <dttm>              <date>     <time>
#> 1 2021-09-01 19:30:00 2021-09-01 19:30 
#> 2 2020-12-19 12:45:00 2020-12-19 12:45 
#> 3 2019-03-16 00:15:00 2019-03-16 00:15

This at least gives you tidy and workable data imported into R, able to be formatted for printing. Does this reach your solution? If it's not working on your data then perhaps post a small sample (or representative sample) as an example to try and get working.

^{Created on 2021-12-07 by the reprex package (v2.0.1)}