Convert String Date to R Date Fast for All Dates

convert string date to R Date FAST for all dates

I can get a little speedup by using the date package:

library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
# user system elapsed
# 6.54 0.01 6.56
system.time(ddate <- as.Date(as.date(x,"ymd")))
# user system elapsed
# 3.42 0.22 3.64

You might want to look at the C code it uses and see if you can modify it to be faster for your specific situation.

Fastest way to convert strings to dates in R

Using a lookup table approach (1) is 38x faster than as.Date for this example data. Approach (2) is still 18x faster for this exammple data and is more convenient if you don't have a good way of generating d. The particulars of your data will, of course, affect the speedup actually achieved.

# test data
set.seed(123)
d <- format(as.Date("2000-01-01") + 1:100, "%m/%d/%Y")
dd <- sample(d, 1000000, replace = TRUE)

# 1 - lookup table
system.time(as.Date(d, "%m/%d/%Y")[match(dd, d)])
## user system elapsed
## 0.06 0.02 0.07

# 2 - table lookup generating u on the fly instead of d
system.time({u <- unique(dd); as.Date(u, "%m/%d/%Y")[match(dd, u)] })
## user system elapsed
## 0.13 0.03 0.15

# 3 - as.Date
system.time(as.Date(dd, "%m/%d/%Y"))
## user system elapsed
## 2.61 0.07 2.67

Fastest way to parse a date-time string to class Date

Note that as.Date will ignore junk after the date so this takes less than 10 seconds on my not particularly fast laptop:

xx <- rep("10/17/2017 12:00:00 AM", 5000000) # test input
system.time(as.Date(xx, "%m/%d/%Y"))
## user system elapsed
## 9.57 0.20 9.82

R - Convert strings to dates

As hinted by @rawr in a comment an hour ago, the anydate() function from my anytime package was made for just this:

  • does not require a format string but checks a number of possible and sensible ones
  • does not require all elements of a vector to use the same format
  • does use vectorised and compiled operations so it is fast

Example

> anytime::anydate(c('Jan_30_2018','April_3-2018','07_June_2018')) 
[1] "2018-01-30" "2018-04-03" "2018-06-07"
>

Converting string to date in R

The frequency can be extracted with a little regex, and the strings can be parsed to dates with anytime::anydate (which inserts "01" for missing date components), but it parses all non-year numbers as months, so a little cleanup is necessary. In tidyverse grammar,

library(tidyverse)
library(lubridate)

df <- data_frame(date = c("2008Q1", "2008M1", "2008M2", "2008M3", "2008Q2", "2008M4", "2008M5",
"2008M6", "2008Q3", "2008M7", "2008M8", "2008M9", "2008Q4", "2008M10",
"2008M11", "2008M12", "2009", "2009Q1", "2009M1", "2009M2", "2009M3",
"2009Q2", "2009M4", "2009M5", "2009M6", "2009Q3", "2009M7", "2009M8",
"2009M9", "2009Q4", "2009M10", "2009M11", "2009M12", "2010"))

df %>%
mutate(frequency = recode(gsub('\\d', '', date), # remove all numbers...
'M' = 'Monthly', ...and recode as words
'Q' = 'Quarterly',
.default = 'Annually'),
date = anytime::anydate(date), # parse to year-month
date = {month(date) <- month(date) * recode(frequency, # ...and correct the month
'Annually' = 12,
'Quarterly' = 3,
.default = 1);
date})
#> # A tibble: 34 x 2
#> date frequency
#> <date> <chr>
#> 1 2008-03-01 Quarterly
#> 2 2008-01-01 Monthly
#> 3 2008-02-01 Monthly
#> 4 2008-03-01 Monthly
#> 5 2008-06-01 Quarterly
#> 6 2008-04-01 Monthly
#> 7 2008-05-01 Monthly
#> 8 2008-06-01 Monthly
#> 9 2008-09-01 Quarterly
#> 10 2008-07-01 Monthly
#> # ... with 24 more rows

This approach moves adjusts appropriately so as to change quarterly and annual data so that the dates line up with the first day of the last month of the period, as the desired result in the question does. Generally, it's actually much more useful to store the first day of the period, which you can obtain by leveraging the extreme versatility of lubridate::parse_date_time to build a proper parser for the mixed format:

df %>% 
mutate(frequency = recode(gsub('\\d', '', date),
'M' = 'Monthly',
'Q' = 'Quarterly',
.default = 'Annually'),
date = as_date(parse_date_time(
date,
c('Ym', 'Yq', 'Y'), # possible formats
select_formats = function(dates){ # function to determine format
recode(gsub('\\%.[a-z]?', '', names(dates)),
'M' = '%YM%m',
'Q' = '%YQ%q',
.default = '%Y')
})))
#> # A tibble: 34 x 2
#> date frequency
#> <date> <chr>
#> 1 2008-01-01 Quarterly
#> 2 2008-01-01 Monthly
#> 3 2008-02-01 Monthly
#> 4 2008-03-01 Monthly
#> 5 2008-04-01 Quarterly
#> 6 2008-04-01 Monthly
#> 7 2008-05-01 Monthly
#> 8 2008-06-01 Monthly
#> 9 2008-07-01 Quarterly
#> 10 2008-07-01 Monthly
#> # ... with 24 more rows

Convert any string to date in R

Building on @jpmam1's comment, it looks like you can just use lubridate::parse_date_time with an unlimited number of patterns. If you specify enough, it will match anything.

mydates <- c("April 11, 2020", "Apr 11", "4/11/20", "04-11", "April 11, 1 p.m.", "04/11/2020, 1:00pm")
parse_date_time(mydates,c("mdy","mdY","Bdy","bd","md","Bdh","mdYHM"))
#[1] "2020-04-11 00:00:00 UTC" "0000-04-11 00:00:00 UTC" "2020-04-11 00:00:00 UTC" "0000-04-11 00:00:00 UTC" "2020-04-11 01:00:00 UTC"
#[6] "2020-04-11 01:00:00 UTC"

It matches yearless dates with 0000, something you could fix afterwards.

Convert string to datatime object in r

As previous comments and answers suggested, the POSIXct (i.e., datetime) class in R always stores dates along with times. If you convert from a character object with just times to that class, today's date is added by default (if you want another date, you could do, for example, this: as.POSIXct(paste("2020-01-01", times), format = "%Y-%m-%d %H:%M")).

However, this should almost never be a problem since you can use format(times, format = "%H:%M") or for ggplot2 scale_x_datetime to get just the times back. For plotting, this would look something like this:

times <- c("00:30", "01:30", "02:30", "03:30", "04:30", "05:30", "06:30", "07:30", "08:30", "09:30", "10:30", "11:30", "12:30", "13:30", "14:30",
"15:30", "16:30", "17:30", "18:30", "19:30", "20:30", "21:30", "22:30", "23:30")

library(tidyverse)
df <- tibble(
time_chr = times,
time = as.POSIXct(times, format = "%H:%M"),
value = rnorm(length(times))
)
df
#> # A tibble: 24 x 3
#> time_chr time value
#> <chr> <dttm> <dbl>
#> 1 00:30 2020-03-12 00:30:00 0.352
#> 2 01:30 2020-03-12 01:30:00 -0.547
#> 3 02:30 2020-03-12 02:30:00 -0.574
#> 4 03:30 2020-03-12 03:30:00 0.843
#> 5 04:30 2020-03-12 04:30:00 0.798
#> 6 05:30 2020-03-12 05:30:00 -0.620
#> 7 06:30 2020-03-12 06:30:00 0.213
#> 8 07:30 2020-03-12 07:30:00 1.21
#> 9 08:30 2020-03-12 08:30:00 0.370
#> 10 09:30 2020-03-12 09:30:00 0.497
#> # … with 14 more rows

ggplot(df, aes(x = time, y = value)) +
geom_line() +
scale_x_datetime(date_labels = "%H:%M")

Sample Image

Created on 2020-03-12 by the reprex package (v0.3.0)

Faster date formatting in R?

Since I wrote this before it was pointed out this is a duplicate, I'll add it as an answer anyway. Basically package fasttime can help you IF you have dates AFTER 1970-01-01 00:00:00 AND they are GMT AND they are of the format year, month, day, hour, minute, second. If you can rewrite your dates to this format then fastPOSIXct will be quick:

#  data
date <- c( "2013/5/31 23:30" , "2013/5/31 23:35" , "2013/5/31 23:40" , "2013/5/31 23:45" )

require(fasttime)
# fasttime function
dates.ft <- fastPOSIXct( date , tz = "GMT" )

# base function
dates <- as.POSIXct( date , format= "%Y/%m/%d %H:%M")

# rough comparison
require(microbenchmark)
microbenchmark( fastPOSIXct( date , tz = "GMT" ) , as.POSIXct( date , format= "%Y/%m/%d %H:%M") , times = 100L )
#Unit: microseconds
# expr min lq median uq max neval
# fastPOSIXct(date, tz = "GMT") 19.598 21.699 24.148 25.5485 215.927 100
# as.POSIXct(date, format = "%Y/%m/%d %H:%M") 160.633 163.433 168.332 181.9800 278.220 100

But the question would be, is it quicker to transform your dates to a format fasttime can accept or just use as.POSIXct or buy a faster computer?!

Slow String to Date Conversion Function

Just do:

as.Date(dates, format = "%m/%d/%Y")
  1. You don't need to loop over the dates vector as as.Date() can handle a vector of characters just fine in a single shot. Your function is incurring length(dates) calls to as.Date() plus some assignments to other functions, which all have overhead that is totally unnecessary.
  2. You don't want to convert each individual date to a factor. You don't want to convert them at all (as.Date() will just convert them back to characters). If you did want to convert them, factor() is also vectorised, so you could (but you don't need this at all, anywhere in your function) remove the factor() line and insert dates <- as.factor(dates) outside the for() loop. But again, you don't need to do this at all!


Related Topics



Leave a reply



Submit