Can pandas automatically read dates from a CSV file?
You should add parse_dates=True
, or parse_dates=['column name']
when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.
Suppose you have a column 'datetime' with your string, then:
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)
You can find directives (i.e. the letters to be used for different formats) for strptime
and strftime
in this page.
read.csv in R reading dates differently
This can be solved by reading in the dates as a character vector and then calling strptime()
inside transform()
:
transform(read.csv(path2,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y')));
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976
Edit: You can try to "detect" the date format dynamically using your own assumptions, but this will only be as reliable as your assumptions:
readStockData <- function(path) {
tab <- read.csv(path,colClasses=c('character',rep('numeric',6)));
tab$Date <- as.Date(tab$Date,if (grepl('^\\d+/\\d+/\\d+$',tab$Date[1])) '%m/%d/%Y' else '%Y-%m-%d');
tab;
};
readStockData(path1);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
## 2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
## 3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
## 4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
## 5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
## 6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
readStockData(path2);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976
In the above I've made the assumption that there is at least one record in the file and that all records use the same Date format, thus the first Date value (tab$Date[1]
) can be used for the detection.
How to read in a number from a csv as a time in Pandas
Actually I think this answer is the one I want:
Convert number to time in Python
All I'm looking for is the time, and the date part of the datetime is irrelevant for my needs.
Also this answer to help pad some of those values that are only two digits out to 4: Add Leading Zeros to Strings in Pandas Dataframe
Reading CSV dates with pandas returns datetime instead of Timestamp
You can specify which date_parser
function to be used:
data = pd.read_csv('temp.csv',
parse_dates = ["Local time"],
date_parser=pd.Timestamp)
Output:
>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-03:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-02:00 1.37356 1.37820 1.37211 1.37421 124863.8203
>>> type(data['Local time'][0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
By my observation pandas automatically parses each entry as datetime when timezone are different for individual observation.
The above should work if you really need to use pd.Timestamp
.
Running the above however also gives me a FutureWarning, which I researched and found to be harmless as of now.
EDIT
After a bit more research:
pandas tries to convert a date type column to DatetimeIndex
for more efficiency in datetime based operations.
But for this pandas needs to have a common timezone for the entire column.
On explicitly trying to convert to pd.DatetimeIndex
>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-03:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-04:00 1.37356 1.37820 1.37211 1.37421 124863.8203
>>> pd.DatetimeIndex(data['Local time'])
ValueError: Array must be all same time zone
During handling of the above exception, another exception occurred:
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
So when converting to DatetimeIndex
fails pandas then keeps the data as strings (dtype : object) internally and individual entries to be processed as datetime
.
Documentation recommends that if timezones in the data are different specify UTC=True, so the timezone would be set as UTC and time values would be changed accordingly.
From Documentation:
pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a mixture of timezones, the default result will be an object-dtype column with strings, even with parse_dates.
To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with utc=True
In a data that already has the same timezone DatetimeIndex works seamlessly:
>>> data
Local time Open High Low Close Volume
0 2014-02-03 02:00:00-02:00 1.37620 1.37882 1.37586 1.37745 5616.0400
1 2014-03-03 02:00:00-02:00 1.37745 1.37928 1.37264 1.37357 136554.6563
2 2014-04-03 02:00:00-02:00 1.37356 1.37820 1.37211 1.37421 124863.8203
>>> pd.DatetimeIndex(data['Local time'])
DatetimeIndex(['2014-02-03 02:00:00-02:00', '2014-03-03 02:00:00-02:00',
'2014-04-03 02:00:00-02:00'],
dtype='datetime64[ns, pytz.FixedOffset(-120)]', name='Local time', freq=None)
>>> type(pd.DatetimeIndex(data['Local time'])[0])
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
References:
- https://pandas.pydata.org/docs/user_guide/io.html#io-csv-mixed-timezones
- https://pandas.pydata.org/docs/reference/api/pandas.DatetimeIndex.html
- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#parse_dates
How to import date from CSV in dd/mm/yyyy hh:mm format in to general number string in R
One potential solution could be to use lubridate
to parse the text strings of the date/time columns after import. From this you can extract the date and time (using date()
and hms::as_hms()
):
library(readr)
library(dplyr)
library(lubridate)
read_csv("Date_time\n
01/09/2021 19:30\n
19/12/2020 12:45\n
16/03/2019 00:15") %>%
mutate(Date_time = dmy_hm(Date_time),
Date = date(Date_time),
Time = hms::as_hms(Date_time))
#> # A tibble: 3 x 3
#> Date_time Date Time
#> <dttm> <date> <time>
#> 1 2021-09-01 19:30:00 2021-09-01 19:30
#> 2 2020-12-19 12:45:00 2020-12-19 12:45
#> 3 2019-03-16 00:15:00 2019-03-16 00:15
This at least gives you tidy and workable data imported into R, able to be formatted for printing. Does this reach your solution? If it's not working on your data then perhaps post a small sample (or representative sample) as an example to try and get working.
Created on 2021-12-07 by the reprex package (v2.0.1)
Related Topics
Using Lapply with Changing Arguments
How to Create a Different Report for Each Subset of a Data Frame with R Markdown
Applying R Script Prepared for Single File to Multiple Files in the Directory
Ggplot2 Increase Space Between Legend Keys
How to Drop Columns by Passing Variable Name with Dplyr
How to Set Na.Rm to True Globally
R Reshape a Vector into Multiple Columns
Merge Data Frames and Overwrite Values
Cannot Install R Packages in Jupyter Notebook
How to Compute Correlations Between All Columns in R and Detect Highly Correlated Variables
How to Change Font Family in a Legend in an R-Plot
Read CSV with Dates and Numbers
How to Write an R Function That Evaluates an Expression Within a Data-Frame
How to Create Thiessen Polygons from Points Using R Packages