automatically detect date columns when reading a file into a data.frame
Here I threw one together quickly. It is not handling the last column properly because the as.Date
function is not strict enough (see that as.Date("1/1/2013", "%Y/%m/%d")
parses ok for example...)
my.read.table <- function(..., date.formats = c("%m/%d/%Y", "%Y/%m/%d")) {
dat <- read.table(...)
for (col.idx in seq_len(ncol(dat))) {
x <- dat[, col.idx]
if(!is.character(x) | is.factor(x)) next
if (all(is.na(x))) next
for (f in date.formats) {
d <- as.Date(as.character(x), f)
if (any(is.na(d[!is.na(x)]))) next
dat[, col.idx] <- d
}
}
dat
}
dat <- my.read.table(fh, header = TRUE, stringsAsFactors = FALSE)
as.data.frame(sapply(dat, class))
# sapply(dat, class)
# num integer
# char character
# date.format1 Date
# date.format2 Date
# not.all.dates character
# not.same.formats Date
If you know a way to parse dates that is more strict around formats than as.Date
(see the example above), please let me know.
Edit: To make the date parsing super strict, I can add
if (!identical(x, format(d, f))) next
For it to work, I will need all my input dates to have leading zeroes where needed, i.e. 01/01/2013
and not 1/1/2013
. I can live with that if that's the standard way.
Can pandas automatically read dates from a CSV file?
You should add parse_dates=True
, or parse_dates=['column name']
when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.
Suppose you have a column 'datetime' with your string, then:
from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)
You can find directives (i.e. the letters to be used for different formats) for strptime
and strftime
in this page.
Turn all character dates columns in a dataframe to dates columns r
Updated
I would like to thank Mr. @Gregor Thomas for offering a valuable tip to be added to my solution. We assume that all of your date columns have a date
suffix so that we can tell across
function to only apply Date transformations on them.
library(dplyr)
df %>%
mutate(across(ends_with("date"), ~ as.Date(.x, format = "%m/%d/%Y")))
# A tibble: 2 x 4
id generic_name index_date ami_pre_date
<dbl> <chr> <date> <date>
1 1 ato 2016-10-27 2015-10-20
2 2 sim 2017-07-12 2026-05-01
Data
structure(list(id = c(1, 2), generic_name = c("ato", "sim"),
index_date = c("10/27/2016", "7/12/2017"), ami_pre_date = c("10/20/2015",
"5/1/2026")), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))
Date Format changes within DataFrame
Found a solution with "annoying" workaround (extracting strings of day, month and year).
v2x = r'E:\Model\Data\v2x.csv'
outfile = r'E:\Model\ModelSpecific\Input_shat2.txt'
data = pd.read_csv(v2x, sep=",")
data['Year'] = data['Date'].str.slice(6, 10) #redo the index because of american timestamp
data['Month'] = data['Date'].str.slice(3,5)
data['Day'] = data['Date'].str.slice(0,2)
datetime = pd.to_datetime(data[['Year','Month','Day']])
data = data.drop(['Date','Year','Month','Day'],axis=1)
data = pd.concat((datetime,data),axis=1)
data = data.rename({0:'Date'},axis=1)
data = data.set_index('Date')
Related Topics
Update() Inside a Function Only Searches the Global Environment
Concatenate Values Across Columns in Data.Table, Row by Row
R: Clustering Results Are Different Everytime I Run
How to Add Main Title and Manipulating Axis Labels in Ggplot2 in Rstudio
Directlabels: Avoid Clipping (Like Xpd=True)
Memory Limits in Data Table: Negative Length Vectors Are Not Allowed
Rstudio Shiny Not Able to Use Ggvis
Caret: There Were Missing Values in Resampled Performance Measures
R Multiple Conditions in If Statement
Can You More Clearly Explain Lazy Evaluation in R Function Operators