Why are lubridate functions so slow when compared with as.POSIXct?
For the same reason cars are slow in comparison to riding on top of rockets. The added ease of use and safety make cars much slower than a rocket but you're less likely to get blown up and it's easier to start, steer, and brake a car. However, in the right situation (e.g., I need to get to the moon) the rocket is the right tool for the job. Now if someone invented a car with a rocket strapped to the roof we'd have something.
Start with looking at what dmy
is doing and you'll see the difference for the speed (by the way from your bechmarks I wouldn't say that lubridate
is that much slower as these are in milliseconds):
dmy
#type this into the command line and you get:
>dmy
function (..., quiet = FALSE, tz = "UTC")
{
dates <- unlist(list(...))
parse_date(num_to_date(dates), make_format("dmy"), quiet = quiet,
tz = tz)
}
<environment: namespace:lubridate>
Right away I see parse_date
and num_to_date
and make_format
. Makes one wonder what all these guys are. Let's see:
parse_date
> parse_date
function (x, formats, quiet = FALSE, seps = find_separator(x),
tz = "UTC")
{
fmt <- guess_format(head(x, 100), formats, seps, quiet)
parsed <- as.POSIXct(strptime(x, fmt, tz = tz))
if (length(x) > 2 & !quiet)
message("Using date format ", fmt, ".")
failed <- sum(is.na(parsed)) - sum(is.na(x))
if (failed > 0) {
message(failed, " failed to parse.")
}
parsed
}
<environment: namespace:lubridate>
num_to_date
> getAnywhere(num_to_date)
A single object matching ‘num_to_date’ was found
It was found in the following places
namespace:lubridate
with value
function (x)
{
if (is.numeric(x)) {
x <- as.character(x)
x <- paste(ifelse(nchar(x)%%2 == 1, "0", ""), x, sep = "")
}
x
}
<environment: namespace:lubridate>
make_format
> getAnywhere(make_format)
A single object matching ‘make_format’ was found
It was found in the following places
namespace:lubridate
with value
function (order)
{
order <- strsplit(order, "")[[1]]
formats <- list(d = "%d", m = c("%m", "%b"), y = c("%y",
"%Y"))[order]
grid <- expand.grid(formats, KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE)
lapply(1:nrow(grid), function(i) unname(unlist(grid[i, ])))
}
<environment: namespace:lubridate>
Wow we got strsplit-ting
, expand-ing.grid-s
, paste-ing
, ifelse-ing
, unname-ing
etc. plus a Whole Lotta Error Checking Going On (play on the Zep song). So what we have here is some nice syntactic sugar. Mmmmm tasty but it comes with a price, speed.
Compare that to as.POSIXct
:
getAnywhere(as.POSIXct) #tells us to use methods to see the business
methods('as.POSIXct') #tells us all the business
as.POSIXct.date #what I believe your code is using (I don't use dates though)
There's a lot more Internal coding and less error checking going on with as.POSIXct
So you have to ask do I want ease and safety or speed and power? Depends on the job.
Why are my functions on lubridate dates so slow?
You're looping over every single row. It's not surprising it is slow. You could essentially do one replacement operation instead where you take a fixed difference from each date: 0 for M-F, -1 for Sat and -2 for Sun.
# 'big' sample data
x <- Sys.Date() + 0:100000
bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)
# since `weekdays()` is locale-specific, you could also be defensive and do:
bizdays <- function(x) x - match(format(x, "%w"), c("6","0"), nomatch=0)
system.time(bizdays(x))
# user system elapsed
# 0.36 0.00 0.35
system.time(previous_business_date_if_weekend(x))
# user system elapsed
# 45.45 0.00 45.57
identical(bizdays(x), previous_business_date_if_weekend(x))
#[1] TRUE
Lubridate speed up date time operation
You can use the function ymd_hms()
from the library lubridate
library(lubridate)
library(dplyr)
"04/03/2019 07:22:14" %>%
fast_strptime(format = '%d/%m/%Y %H:%M:%S') %>%
ymd_hms()
# [1] "2019-03-04 07:22:14 UTC"
Why is as.Date slow on a character vector?
I think it's just that as.Date
converts character
to Date
via POSIXlt
, using strptime
. And strptime
is very slow, I believe.
To trace it through yourself, type as.Date
, then methods(as.Date)
, then look at the character
method.
> as.Date
function (x, ...)
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>
> methods(as.Date)
[1] as.Date.character as.Date.date as.Date.dates as.Date.default
[5] as.Date.factor as.Date.IDate* as.Date.numeric as.Date.POSIXct
[9] as.Date.POSIXlt
Non-visible functions are asterisked
> as.Date.character
function (x, format = "", ...)
{
charToDate <- function(x) {
xx <- x[1L]
if (is.na(xx)) {
j <- 1L
while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
if (is.na(xx))
f <- "%Y-%m-%d"
}
if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d",
tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d",
tz = "GMT")))
return(strptime(x, f))
stop("character string is not in a standard unambiguous format")
}
res <- if (missing(format))
charToDate(x)
else strptime(x, format, tz = "GMT") #### slow part, I think ####
as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
>
Why is as.POSIXlt(Date)$year+1900
relatively fast? Again, trace it through :
> as.POSIXct
function (x, tz = "", ...)
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>
> methods(as.POSIXct)
[1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default
[5] as.POSIXct.IDate* as.POSIXct.ITime* as.POSIXct.numeric as.POSIXct.POSIXlt
Non-visible functions are asterisked
> as.POSIXlt.Date
function (x, ...)
{
y <- .Internal(Date2POSIXlt(x))
names(y$year) <- names(x)
y
}
<bytecode: 0x395e328>
<environment: namespace:base>
>
Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.
~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
Now we know we need to look for D2POSIXlt :
~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :
datetime.c
Search in there for D2POSIXlt
and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!
So the crux of the issue (I think) is why strptime
is so slow, and maybe that can be improved in R. Or just avoid POSIXlt
, either directly or indirectly.
Here's a reproducible example using the number of items stated in question (3,000,000) :
> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
user system elapsed
21.681 0.060 21.760
> system.time(strptime(Date, "%m/%d/%Y"))
user system elapsed
29.594 8.633 38.270
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
user system elapsed
19.785 0.000 19.802
Passing tz
appears to speed up strptime
, which as.Date.character
does. So maybe it depends on your locale. But strptime
appears to be the culprit, not data.table
. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?
speeding up as.POSIXct with large data / issue with storing as POSIXct in data.table
I would suggest two options.
I assume you use write.csv
or similar, which convert POSIXct
to character
when writing it out. This slows down both the writing out and reading in, as POSIXct
object are really numbers and not characters (more precisely they are seconds since "epoch"). So you can convert the column to numeric
, and then write that out, and convert back to POSIXct
after reading in (which will be super fast).
Another option, if you prefer to write out character columns, is to use fastPOSIXct
from fasttime
to speed up the conversion to POSIXct
.
Why does lubridate::mdy() add day when it is missing from my input?
It is related to the order of parsing. According to ?mdy
In case of heterogeneous date formats, the ymd() family guesses formats based on a subset of the input vector. If the input vector contains many missing values or non-date strings, the subset might not contain meaningful dates
The original string includes month followed by 4 digit year and mdy
is month
, day
year
and year
can be either 2 digit or 4 digit. Now, there is a confusion and it selects 2 digit year as '10' and the day are parsed as '20'. Instead, if we add a day and then use mdy
, it would parse as 4 digit year
lubridate::myd(paste(txt, '01'))
#[1] "2010-01-01"
Trouble dealing with POSIXct timezones and truncating the time out of POSIXct objects
If you don't specify a timezone then R will use your system's locale as POSIXct objects must have a timezone. The difference between CEST and CET is that one is summertime and one is not. That means if you define a date during the part of the year defined as summertime then R will decide to use the summertime version of the timezone. If you want to set dates that don't use summertime versions then define them as GMT from the beginning.
formatString = "%Y-%m-%d %H:%M:%OS"
x = as.POSIXct(strptime("2013-11-23 23:10:38.000000", formatString), tz="GMT")
y = as.POSIXct(strptime("2015-07-17 01:43:38.000000", formatString), tz="GMT")
If you want to truncate out the time, don't use as.Date
on a POSIXct object since as.Date
is meant to convert character objects to Date objects (which aren't the same as POSIXct objects). If you want to truncate POSIXct objects with base R then you'll have to wrap either round
or trunc
in as.POSIXct
but I would recommend checking out the lubridate
package for dealing with dates and times (specifically POSIXct objects).
If you want to keep CET but never use CEST you can use a location that doesn't observe daylight savings. According to http://www.timeanddate.com/time/zones/cet your only options are Algeria and Tunisia. According to https://en.wikipedia.org/wiki/List_of_tz_database_time_zones the valid tz would be "Africa/Algiers". Therefore you could do
formatString = "%Y-%m-%d %H:%M:%OS"
x = as.POSIXct(strptime("2013-11-23 23:10:38.000000", formatString), tz="Africa/Algiers")
y = as.POSIXct(strptime("2015-07-17 01:43:38.000000", formatString), tz="Africa/Algiers")
and both x and y would be in CET.
One more thing about setting timezones. If you tell R you want a generic timezone then it won't override daylight savings settings. That's why setting attr(y, "tzone") <- "CET"
didn't have the desired result. If you did attr(y, "tzone") <- "Africa/Algiers"
then it would have worked as you expected. Do be careful with conversions though because when you change the timezone it will change the time to account for the new timezone. The package lubridate
has the function force_tz
which changes the timezone without changing the time for cases where the initial timezone setting was wrong but the time was right.
How to fast convert different time formats in large data frames?
We may use %OS
instead of %S
to account for decimals in seconds.
help("strptime")
Specific to R is
%OSn
, which for output gives the seconds truncated to
0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it
uses the setting of getOption("digits.secs"), or if that is unset, n =
0).
as.POSIXct(time, format="%Y-%m-%dT%H:%M:%OSZ")
# [1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST"
# [3] "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"
This base R code is considerably faster than the package solutions, try it yourself.
Update 1
time2 <- c("2018-09-01T12:42:37.000+02:00", "2018-10-01T11:42:37.000+03:00")
This one is trickier. ?strptime
says we should use %z
for offsets from UTC, but somehow it won't work with as.POSIXct
. Instead we could do this,
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(time2, 24), "\\:")))
(os[1]*60 + os[2])*60}
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
which cuts the unreadable part from the string, converts it to seconds and adds it to the "POSIXct"
object.
If there are only hours as in time2
, we could also say:
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
as.numeric(substr(time2, 24, 26))*3600
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
That the code is slightly longer now should not obscure the fact that it runs practically as fast as the one at top of the answer.
Update 2
You could wrap the current three variants into a function with if (nchar(x) == 29) ... else
structure, such as this one:
fixDateTime <- function(x) {
s <- split(x, nchar(x))
if ("20" %in% names(s))
s$`20` <- as.POSIXct(s$`20` , format="%Y-%m-%dT%H:%M:%SZ")
else if ("24" %in% names(s))
s$`24` <- as.POSIXct(s$`24`, format="%Y-%m-%dT%H:%M:%OSZ")
else if ("29" %in% names(s))
s$`29` <- as.POSIXct(substr(s$`29`, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(s[[3]], 24), "\\:")))
(os[1]*60 + os[2])*60}
return(unsplit(s, nchar(x)))
}
res <- fixDateTime(time3)
res
# [1] "2018-07-29 15:02:05 CEST" "2018-10-04 00:00:00 CEST" "2018-10-01 00:00:00 CEST"
str(res)
# POSIXct[1:3], format: "2018-07-29 15:02:05" "2018-10-04 00:00:00" "2018-10-01 00:00:00"
Compared to the packages only fixDateTime
can handle all three defined date-time types. According to the concluding benchmark the function is still very fast.
Note: The function logically fails if different date formats have the same nchar
, and it should be customized in the case (e.g. by another split
condition)! Not tested: daylight saving time behavior when adding seconds to POSIXct
.
Benchmark
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# fixDateTime 35.46387 35.94761 40.07578 36.05923 39.54706 68.46211 10 c
# as.POSIXct 20.32820 20.45985 21.00461 20.62237 21.16019 23.56434 10 b # to compare
# lubridate 11.59311 11.68956 12.88880 12.01077 13.76151 16.54479 10 a # produces NAs!
# anytime 198.57292 201.06483 203.95131 202.91368 203.62130 212.83272 10 d # produces NAs!
Data
time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z", "2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
time2 <- c("2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z")
time3 <- c("2018-07-29T15:02:05Z", "2018-10-04T12:13:41.333Z",
"2018-10-01T11:42:37.000+03:00")
Benchmark code
n <- 1e3
t1 <- sample(time2, n, replace=TRUE)
t2 <- sample(time3, n, replace=TRUE)
library(lubridate)
library(anytime)
microbenchmark::microbenchmark(fixDateTime=fixDateTime(t2),
as.POSIXct=as.POSIXct(t1, format="%Y-%m-%dT%H:%M:%OSZ"),
lubridate=parse_date_time(t2, "ymd_HMS"),
anytime=anytime(t2),
times=10L)
Related Topics
Extract Survival Probabilities in Survfit by Groups
R Function Prcomp Fails with Na's Values Even Though Na's Are Allowed
How to Find the First and Last Occurrences of an Element in a Data.Frame
Plotting Average of Multiple Variables in Time-Series Using Ggplot
Combine Voronoi Polygons and Maps
Remove the Rows That Have Non-Numeric Characters in One Column in R
Access Data.Table Columns with Strings
Ellipse Containing Percentage of Given Points in R
What Is R's Crossproduct Function
Rcmdr Launch Error in Yosemite (Os X 10.10)
Match.Call with Default Arguments
S3 Method Consistency Warning When Building R Package with Roxygen
Import Multiple Text Files in R and Assign Them Names from a Predetermined List
Aggregating Multiple Columns in Data.Table