Read CSV with Dates and Numbers

Can pandas automatically read dates from a CSV file?

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column 'datetime' with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.

read.csv in R reading dates differently

This can be solved by reading in the dates as a character vector and then calling strptime() inside transform():

transform(read.csv(path2,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y')));
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976

Edit: You can try to "detect" the date format dynamically using your own assumptions, but this will only be as reliable as your assumptions:

readStockData <- function(path) {
tab <- read.csv(path,colClasses=c('character',rep('numeric',6)));
tab$Date <- as.Date(tab$Date,if (grepl('^\\d+/\\d+/\\d+$',tab$Date[1])) '%m/%d/%Y' else '%Y-%m-%d');
tab;
};
readStockData(path1);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
## 2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
## 3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
## 4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
## 5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
## 6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
readStockData(path2);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976

In the above I've made the assumption that there is at least one record in the file and that all records use the same Date format, thus the first Date value (tab$Date[1]) can be used for the detection.

How to read in a number from a csv as a time in Pandas

Actually I think this answer is the one I want:
Convert number to time in Python

All I'm looking for is the time, and the date part of the datetime is irrelevant for my needs.

Also this answer to help pad some of those values that are only two digits out to 4: Add Leading Zeros to Strings in Pandas Dataframe

Reading CSV dates with pandas returns datetime instead of Timestamp

You can specify which date_parser function to be used:

data = pd.read_csv('temp.csv', 
parse_dates = ["Local time"],
date_parser=pd.Timestamp)

Output:

>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-03:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-02:00 1.37356 1.37820 1.37211 1.37421 124863.8203

>>> type(data['Local time'][0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>

By my observation pandas automatically parses each entry as datetime when timezone are different for individual observation.

The above should work if you really need to use pd.Timestamp.

Running the above however also gives me a FutureWarning, which I researched and found to be harmless as of now.

EDIT

After a bit more research:

pandas tries to convert a date type column to DatetimeIndex for more efficiency in datetime based operations.
But for this pandas needs to have a common timezone for the entire column.

On explicitly trying to convert to pd.DatetimeIndex

>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-03:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-04:00 1.37356 1.37820 1.37211 1.37421 124863.8203

>>> pd.DatetimeIndex(data['Local time'])

ValueError: Array must be all same time zone

During handling of the above exception, another exception occurred:

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

So when converting to DatetimeIndex fails pandas then keeps the data as strings (dtype : object) internally and individual entries to be processed as datetime.

Documentation recommends that if timezones in the data are different specify UTC=True, so the timezone would be set as UTC and time values would be changed accordingly.

From Documentation:

pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a mixture of timezones, the default result will be an object-dtype column with strings, even with parse_dates.

To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with utc=True

In a data that already has the same timezone DatetimeIndex works seamlessly:

>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-02:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-02:00 1.37356 1.37820 1.37211 1.37421 124863.8203

>>> pd.DatetimeIndex(data['Local time'])

DatetimeIndex(['2014-02-03 02:00:00-02:00', '2014-03-03 02:00:00-02:00',
'2014-04-03 02:00:00-02:00'],
dtype='datetime64[ns, pytz.FixedOffset(-120)]', name='Local time', freq=None)

>>> type(pd.DatetimeIndex(data['Local time'])[0])

<class 'pandas._libs.tslibs.timestamps.Timestamp'>

References:

  • https://pandas.pydata.org/docs/user_guide/io.html#io-csv-mixed-timezones
  • https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html
  • https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#parse_dates

How to import date from CSV in dd/mm/yyyy hh:mm format in to general number string in R

One potential solution could be to use lubridate to parse the text strings of the date/time columns after import. From this you can extract the date and time (using date() and hms::as_hms()):

library(readr)
library(dplyr)
library(lubridate)

read_csv("Date_time\n
01/09/2021 19:30\n
19/12/2020 12:45\n
16/03/2019 00:15") %>%
mutate(Date_time = dmy_hm(Date_time),
Date = date(Date_time),
Time = hms::as_hms(Date_time))

#> # A tibble: 3 x 3
#> Date_time Date Time
#> <dttm> <date> <time>
#> 1 2021-09-01 19:30:00 2021-09-01 19:30
#> 2 2020-12-19 12:45:00 2020-12-19 12:45
#> 3 2019-03-16 00:15:00 2019-03-16 00:15

This at least gives you tidy and workable data imported into R, able to be formatted for printing. Does this reach your solution? If it's not working on your data then perhaps post a small sample (or representative sample) as an example to try and get working.

Created on 2021-12-07 by the reprex package (v2.0.1)



Related Topics



Leave a reply



Submit