Automatically Detect Date Columns When Reading a File into a Data.Frame

automatically detect date columns when reading a file into a data.frame

Here I threw one together quickly. It is not handling the last column properly because the as.Date function is not strict enough (see that as.Date("1/1/2013", "%Y/%m/%d") parses ok for example...)

my.read.table <- function(..., date.formats = c("%m/%d/%Y", "%Y/%m/%d")) {
dat <- read.table(...)
for (col.idx in seq_len(ncol(dat))) {
x <- dat[, col.idx]
if(!is.character(x) | is.factor(x)) next
if (all(is.na(x))) next
for (f in date.formats) {
d <- as.Date(as.character(x), f)
if (any(is.na(d[!is.na(x)]))) next
dat[, col.idx] <- d
}
}
dat
}

dat <- my.read.table(fh, header = TRUE, stringsAsFactors = FALSE)
as.data.frame(sapply(dat, class))

# sapply(dat, class)
# num integer
# char character
# date.format1 Date
# date.format2 Date
# not.all.dates character
# not.same.formats Date

If you know a way to parse dates that is more strict around formats than as.Date (see the example above), please let me know.

Edit: To make the date parsing super strict, I can add

if (!identical(x, format(d, f))) next

For it to work, I will need all my input dates to have leading zeroes where needed, i.e. 01/01/2013 and not 1/1/2013. I can live with that if that's the standard way.

Can pandas automatically read dates from a CSV file?

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column 'datetime' with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.

Turn all character dates columns in a dataframe to dates columns r

Updated
I would like to thank Mr. @Gregor Thomas for offering a valuable tip to be added to my solution. We assume that all of your date columns have a date suffix so that we can tell across function to only apply Date transformations on them.

library(dplyr)

df %>%
mutate(across(ends_with("date"), ~ as.Date(.x, format = "%m/%d/%Y")))

# A tibble: 2 x 4
id generic_name index_date ami_pre_date
<dbl> <chr> <date> <date>
1 1 ato 2016-10-27 2015-10-20
2 2 sim 2017-07-12 2026-05-01

Data

structure(list(id = c(1, 2), generic_name = c("ato", "sim"), 
index_date = c("10/27/2016", "7/12/2017"), ami_pre_date = c("10/20/2015",
"5/1/2026")), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"))

Date Format changes within DataFrame

Found a solution with "annoying" workaround (extracting strings of day, month and year).

v2x = r'E:\Model\Data\v2x.csv'
outfile = r'E:\Model\ModelSpecific\Input_shat2.txt'

data = pd.read_csv(v2x, sep=",")

data['Year'] = data['Date'].str.slice(6, 10) #redo the index because of american timestamp
data['Month'] = data['Date'].str.slice(3,5)
data['Day'] = data['Date'].str.slice(0,2)
datetime = pd.to_datetime(data[['Year','Month','Day']])
data = data.drop(['Date','Year','Month','Day'],axis=1)
data = pd.concat((datetime,data),axis=1)
data = data.rename({0:'Date'},axis=1)
data = data.set_index('Date')


Related Topics



Leave a reply



Submit