convert string date to R Date FAST for all dates
I can get a little speedup by using the date
package:
library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
# user system elapsed
# 6.54 0.01 6.56
system.time(ddate <- as.Date(as.date(x,"ymd")))
# user system elapsed
# 3.42 0.22 3.64
You might want to look at the C code it uses and see if you can modify it to be faster for your specific situation.
Fastest way to convert strings to dates in R
Using a lookup table approach (1) is 38x faster than as.Date
for this example data. Approach (2) is still 18x faster for this exammple data and is more convenient if you don't have a good way of generating d
. The particulars of your data will, of course, affect the speedup actually achieved.
# test data
set.seed(123)
d <- format(as.Date("2000-01-01") + 1:100, "%m/%d/%Y")
dd <- sample(d, 1000000, replace = TRUE)
# 1 - lookup table
system.time(as.Date(d, "%m/%d/%Y")[match(dd, d)])
## user system elapsed
## 0.06 0.02 0.07
# 2 - table lookup generating u on the fly instead of d
system.time({u <- unique(dd); as.Date(u, "%m/%d/%Y")[match(dd, u)] })
## user system elapsed
## 0.13 0.03 0.15
# 3 - as.Date
system.time(as.Date(dd, "%m/%d/%Y"))
## user system elapsed
## 2.61 0.07 2.67
Fastest way to parse a date-time string to class Date
Note that as.Date
will ignore junk after the date so this takes less than 10 seconds on my not particularly fast laptop:
xx <- rep("10/17/2017 12:00:00 AM", 5000000) # test input
system.time(as.Date(xx, "%m/%d/%Y"))
## user system elapsed
## 9.57 0.20 9.82
R - Convert strings to dates
As hinted by @rawr in a comment an hour ago, the anydate()
function from my anytime package was made for just this:
- does not require a format string but checks a number of possible and sensible ones
- does not require all elements of a vector to use the same format
- does use vectorised and compiled operations so it is fast
Example
> anytime::anydate(c('Jan_30_2018','April_3-2018','07_June_2018'))
[1] "2018-01-30" "2018-04-03" "2018-06-07"
>
Converting string to date in R
The frequency can be extracted with a little regex, and the strings can be parsed to dates with anytime::anydate
(which inserts "01" for missing date components), but it parses all non-year numbers as months, so a little cleanup is necessary. In tidyverse grammar,
library(tidyverse)
library(lubridate)
df <- data_frame(date = c("2008Q1", "2008M1", "2008M2", "2008M3", "2008Q2", "2008M4", "2008M5",
"2008M6", "2008Q3", "2008M7", "2008M8", "2008M9", "2008Q4", "2008M10",
"2008M11", "2008M12", "2009", "2009Q1", "2009M1", "2009M2", "2009M3",
"2009Q2", "2009M4", "2009M5", "2009M6", "2009Q3", "2009M7", "2009M8",
"2009M9", "2009Q4", "2009M10", "2009M11", "2009M12", "2010"))
df %>%
mutate(frequency = recode(gsub('\\d', '', date), # remove all numbers...
'M' = 'Monthly', ...and recode as words
'Q' = 'Quarterly',
.default = 'Annually'),
date = anytime::anydate(date), # parse to year-month
date = {month(date) <- month(date) * recode(frequency, # ...and correct the month
'Annually' = 12,
'Quarterly' = 3,
.default = 1);
date})
#> # A tibble: 34 x 2
#> date frequency
#> <date> <chr>
#> 1 2008-03-01 Quarterly
#> 2 2008-01-01 Monthly
#> 3 2008-02-01 Monthly
#> 4 2008-03-01 Monthly
#> 5 2008-06-01 Quarterly
#> 6 2008-04-01 Monthly
#> 7 2008-05-01 Monthly
#> 8 2008-06-01 Monthly
#> 9 2008-09-01 Quarterly
#> 10 2008-07-01 Monthly
#> # ... with 24 more rows
This approach moves adjusts appropriately so as to change quarterly and annual data so that the dates line up with the first day of the last month of the period, as the desired result in the question does. Generally, it's actually much more useful to store the first day of the period, which you can obtain by leveraging the extreme versatility of lubridate::parse_date_time
to build a proper parser for the mixed format:
df %>%
mutate(frequency = recode(gsub('\\d', '', date),
'M' = 'Monthly',
'Q' = 'Quarterly',
.default = 'Annually'),
date = as_date(parse_date_time(
date,
c('Ym', 'Yq', 'Y'), # possible formats
select_formats = function(dates){ # function to determine format
recode(gsub('\\%.[a-z]?', '', names(dates)),
'M' = '%YM%m',
'Q' = '%YQ%q',
.default = '%Y')
})))
#> # A tibble: 34 x 2
#> date frequency
#> <date> <chr>
#> 1 2008-01-01 Quarterly
#> 2 2008-01-01 Monthly
#> 3 2008-02-01 Monthly
#> 4 2008-03-01 Monthly
#> 5 2008-04-01 Quarterly
#> 6 2008-04-01 Monthly
#> 7 2008-05-01 Monthly
#> 8 2008-06-01 Monthly
#> 9 2008-07-01 Quarterly
#> 10 2008-07-01 Monthly
#> # ... with 24 more rows
Convert any string to date in R
Building on @jpmam1's comment, it looks like you can just use lubridate::parse_date_time
with an unlimited number of patterns. If you specify enough, it will match anything.
mydates <- c("April 11, 2020", "Apr 11", "4/11/20", "04-11", "April 11, 1 p.m.", "04/11/2020, 1:00pm")
parse_date_time(mydates,c("mdy","mdY","Bdy","bd","md","Bdh","mdYHM"))
#[1] "2020-04-11 00:00:00 UTC" "0000-04-11 00:00:00 UTC" "2020-04-11 00:00:00 UTC" "0000-04-11 00:00:00 UTC" "2020-04-11 01:00:00 UTC"
#[6] "2020-04-11 01:00:00 UTC"
It matches yearless dates with 0000
, something you could fix afterwards.
Convert string to datatime object in r
As previous comments and answers suggested, the POSIXct
(i.e., datetime) class in R
always stores dates along with times. If you convert from a character object with just times to that class, today's date is added by default (if you want another date, you could do, for example, this: as.POSIXct(paste("2020-01-01", times), format = "%Y-%m-%d %H:%M")
).
However, this should almost never be a problem since you can use format(times, format = "%H:%M")
or for ggplot2
scale_x_datetime
to get just the times back. For plotting, this would look something like this:
times <- c("00:30", "01:30", "02:30", "03:30", "04:30", "05:30", "06:30", "07:30", "08:30", "09:30", "10:30", "11:30", "12:30", "13:30", "14:30",
"15:30", "16:30", "17:30", "18:30", "19:30", "20:30", "21:30", "22:30", "23:30")
library(tidyverse)
df <- tibble(
time_chr = times,
time = as.POSIXct(times, format = "%H:%M"),
value = rnorm(length(times))
)
df
#> # A tibble: 24 x 3
#> time_chr time value
#> <chr> <dttm> <dbl>
#> 1 00:30 2020-03-12 00:30:00 0.352
#> 2 01:30 2020-03-12 01:30:00 -0.547
#> 3 02:30 2020-03-12 02:30:00 -0.574
#> 4 03:30 2020-03-12 03:30:00 0.843
#> 5 04:30 2020-03-12 04:30:00 0.798
#> 6 05:30 2020-03-12 05:30:00 -0.620
#> 7 06:30 2020-03-12 06:30:00 0.213
#> 8 07:30 2020-03-12 07:30:00 1.21
#> 9 08:30 2020-03-12 08:30:00 0.370
#> 10 09:30 2020-03-12 09:30:00 0.497
#> # … with 14 more rows
ggplot(df, aes(x = time, y = value)) +
geom_line() +
scale_x_datetime(date_labels = "%H:%M")
Created on 2020-03-12 by the reprex package (v0.3.0)
Faster date formatting in R?
Since I wrote this before it was pointed out this is a duplicate, I'll add it as an answer anyway. Basically package fasttime
can help you IF you have dates AFTER 1970-01-01 00:00:00
AND they are GMT
AND they are of the format year, month, day, hour, minute, second
. If you can rewrite your dates to this format then fastPOSIXct
will be quick:
# data
date <- c( "2013/5/31 23:30" , "2013/5/31 23:35" , "2013/5/31 23:40" , "2013/5/31 23:45" )
require(fasttime)
# fasttime function
dates.ft <- fastPOSIXct( date , tz = "GMT" )
# base function
dates <- as.POSIXct( date , format= "%Y/%m/%d %H:%M")
# rough comparison
require(microbenchmark)
microbenchmark( fastPOSIXct( date , tz = "GMT" ) , as.POSIXct( date , format= "%Y/%m/%d %H:%M") , times = 100L )
#Unit: microseconds
# expr min lq median uq max neval
# fastPOSIXct(date, tz = "GMT") 19.598 21.699 24.148 25.5485 215.927 100
# as.POSIXct(date, format = "%Y/%m/%d %H:%M") 160.633 163.433 168.332 181.9800 278.220 100
But the question would be, is it quicker to transform your dates to a format fasttime
can accept or just use as.POSIXct
or buy a faster computer?!
Slow String to Date Conversion Function
Just do:
as.Date(dates, format = "%m/%d/%Y")
- You don't need to loop over the
dates
vector asas.Date()
can handle a vector of characters just fine in a single shot. Your function is incurringlength(dates)
calls toas.Date()
plus some assignments to other functions, which all have overhead that is totally unnecessary. - You don't want to convert each individual date to a factor. You don't want to convert them at all (
as.Date()
will just convert them back to characters). If you did want to convert them,factor()
is also vectorised, so you could (but you don't need this at all, anywhere in your function) remove thefactor()
line and insertdates <- as.factor(dates)
outside thefor()
loop. But again, you don't need to do this at all!
Related Topics
Model Matrix with All Pairwise Interactions Between Columns
What Are Productive Ways to Debug Rcpp Compiled Code Loaded in R (On Os X Mavericks)
R - Waiting for Page to Load in Rselenium with Phantomjs
How to Speed Up R Packages Installation in Docker
Frequency Table with Several Variables in R
Ggplot2 Multiline Title, Different Indentations
Specifying the Scale for the Density in Ggplot2's Stat_Density2D
Check If Character String Is a Valid Color Representation
Multiply Columns in a Data Frame by a Vector
Subset Data Based on Partial Match of Column Names
How to Increase Stack Space Overflow for Pandoc in R
When Writing My Own R Package, I Can't Seem to Get Other Packages to Import Correctly
How to Better Create Stacked Bar Graphs with Multiple Variables from Ggplot2
Using Filtered Datatables in Shiny
Grouping with Custom Geom Fails - How to Inspect Internal Object from Draw_Panel()
Find *All* Duplicated Records in Data.Table (Not All-But-One)