Why Are Lubridate Functions So Slow When Compared with As.Posixct

Why are lubridate functions so slow when compared with as.POSIXct?

For the same reason cars are slow in comparison to riding on top of rockets. The added ease of use and safety make cars much slower than a rocket but you're less likely to get blown up and it's easier to start, steer, and brake a car. However, in the right situation (e.g., I need to get to the moon) the rocket is the right tool for the job. Now if someone invented a car with a rocket strapped to the roof we'd have something.

Start with looking at what dmy is doing and you'll see the difference for the speed (by the way from your bechmarks I wouldn't say that lubridate is that much slower as these are in milliseconds):

dmy #type this into the command line and you get:

>dmy
function (..., quiet = FALSE, tz = "UTC") 
{
    dates <- unlist(list(...))
    parse_date(num_to_date(dates), make_format("dmy"), quiet = quiet, 
        tz = tz)
}
<environment: namespace:lubridate>

Right away I see parse_date and num_to_date and make_format. Makes one wonder what all these guys are. Let's see:

parse_date

> parse_date
function (x, formats, quiet = FALSE, seps = find_separator(x), 
    tz = "UTC") 
{
    fmt <- guess_format(head(x, 100), formats, seps, quiet)
    parsed <- as.POSIXct(strptime(x, fmt, tz = tz))
    if (length(x) > 2 & !quiet) 
        message("Using date format ", fmt, ".")
    failed <- sum(is.na(parsed)) - sum(is.na(x))
    if (failed > 0) {
        message(failed, " failed to parse.")
    }
    parsed
}
<environment: namespace:lubridate>

num_to_date

> getAnywhere(num_to_date)
A single object matching ‘num_to_date’ was found
It was found in the following places
  namespace:lubridate
with value

function (x) 
{
    if (is.numeric(x)) {
        x <- as.character(x)
        x <- paste(ifelse(nchar(x)%%2 == 1, "0", ""), x, sep = "")
    }
    x
}
<environment: namespace:lubridate>

make_format

> getAnywhere(make_format)
A single object matching ‘make_format’ was found
It was found in the following places
  namespace:lubridate
with value

function (order) 
{
    order <- strsplit(order, "")[[1]]
    formats <- list(d = "%d", m = c("%m", "%b"), y = c("%y", 
        "%Y"))[order]
    grid <- expand.grid(formats, KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE)
    lapply(1:nrow(grid), function(i) unname(unlist(grid[i, ])))
}
<environment: namespace:lubridate>

Wow we got strsplit-ting, expand-ing.grid-s, paste-ing, ifelse-ing, unname-ing etc. plus a Whole Lotta Error Checking Going On (play on the Zep song). So what we have here is some nice syntactic sugar. Mmmmm tasty but it comes with a price, speed.

Compare that to as.POSIXct:

getAnywhere(as.POSIXct)  #tells us to use methods to see the business
methods('as.POSIXct')    #tells us all the business
as.POSIXct.date          #what I believe your code is using (I don't use dates though)

There's a lot more Internal coding and less error checking going on with as.POSIXct So you have to ask do I want ease and safety or speed and power? Depends on the job.

Why are my functions on lubridate dates so slow?

You're looping over every single row. It's not surprising it is slow. You could essentially do one replacement operation instead where you take a fixed difference from each date: 0 for M-F, -1 for Sat and -2 for Sun.

# 'big' sample data
x <- Sys.Date() + 0:100000

bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)

# since `weekdays()` is locale-specific, you could also be defensive and do:
bizdays <- function(x) x - match(format(x, "%w"), c("6","0"), nomatch=0)

system.time(bizdays(x))
#   user  system elapsed 
#   0.36    0.00    0.35 

system.time(previous_business_date_if_weekend(x))
#   user  system elapsed 
#  45.45    0.00   45.57 

identical(bizdays(x), previous_business_date_if_weekend(x))
#[1] TRUE

Lubridate speed up date time operation

You can use the function ymd_hms() from the library lubridate

library(lubridate)
library(dplyr)

"04/03/2019 07:22:14" %>%
    fast_strptime(format = '%d/%m/%Y %H:%M:%S') %>% 
    ymd_hms()
# [1] "2019-03-04 07:22:14 UTC"

Why is as.Date slow on a character vector?

I think it's just that as.Date converts character to Date via POSIXlt, using strptime. And strptime is very slow, I believe.

To trace it through yourself, type as.Date, then methods(as.Date), then look at the character method.

> as.Date
function (x, ...) 
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>

> methods(as.Date)
[1] as.Date.character as.Date.date      as.Date.dates     as.Date.default  
[5] as.Date.factor    as.Date.IDate*    as.Date.numeric   as.Date.POSIXct  
[9] as.Date.POSIXlt  
   Non-visible functions are asterisked

> as.Date.character
function (x, format = "", ...) 
{
    charToDate <- function(x) {
        xx <- x[1L]
        if (is.na(xx)) {
            j <- 1L
            while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
            if (is.na(xx)) 
                f <- "%Y-%m-%d"
        }
        if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d", 
            tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d", 
            tz = "GMT"))) 
            return(strptime(x, f))
        stop("character string is not in a standard unambiguous format")
    }
    res <- if (missing(format)) 
        charToDate(x)
    else strptime(x, format, tz = "GMT")       ####  slow part, I think  ####
    as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
>

Why is as.POSIXlt(Date)$year+1900 relatively fast? Again, trace it through :

> as.POSIXct
function (x, tz = "", ...) 
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>

> methods(as.POSIXct)
[1] as.POSIXct.date    as.POSIXct.Date    as.POSIXct.dates   as.POSIXct.default
[5] as.POSIXct.IDate*  as.POSIXct.ITime*  as.POSIXct.numeric as.POSIXct.POSIXlt
   Non-visible functions are asterisked

> as.POSIXlt.Date
function (x, ...) 
{
    y <- .Internal(Date2POSIXlt(x))
    names(y$year) <- names(x)
    y
}
<bytecode: 0x395e328>
<environment: namespace:base>
>

Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.

~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

Now we know we need to look for D2POSIXlt :

~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :

datetime.c

Search in there for D2POSIXlt and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!

So the crux of the issue (I think) is why strptime is so slow, and maybe that can be improved in R. Or just avoid POSIXlt, either directly or indirectly.

Here's a reproducible example using the number of items stated in question (3,000,000) :

> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
   user  system elapsed 
 21.681   0.060  21.760 
> system.time(strptime(Date, "%m/%d/%Y"))
   user  system elapsed 
 29.594   8.633  38.270 
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
   user  system elapsed 
 19.785   0.000  19.802

Passing tz appears to speed up strptime, which as.Date.character does. So maybe it depends on your locale. But strptime appears to be the culprit, not data.table. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?

speeding up as.POSIXct with large data / issue with storing as POSIXct in data.table

I would suggest two options.

I assume you use write.csv or similar, which convert POSIXct to character when writing it out. This slows down both the writing out and reading in, as POSIXct object are really numbers and not characters (more precisely they are seconds since "epoch"). So you can convert the column to numeric, and then write that out, and convert back to POSIXct after reading in (which will be super fast).

Another option, if you prefer to write out character columns, is to use fastPOSIXct from fasttime to speed up the conversion to POSIXct.

Why does lubridate::mdy() add day when it is missing from my input?

It is related to the order of parsing. According to ?mdy

In case of heterogeneous date formats, the ymd() family guesses formats based on a subset of the input vector. If the input vector contains many missing values or non-date strings, the subset might not contain meaningful dates

The original string includes month followed by 4 digit year and mdy is month, day year and year can be either 2 digit or 4 digit. Now, there is a confusion and it selects 2 digit year as '10' and the day are parsed as '20'. Instead, if we add a day and then use mdy, it would parse as 4 digit year

lubridate::myd(paste(txt, '01'))
#[1] "2010-01-01"

Trouble dealing with POSIXct timezones and truncating the time out of POSIXct objects

If you don't specify a timezone then R will use your system's locale as POSIXct objects must have a timezone. The difference between CEST and CET is that one is summertime and one is not. That means if you define a date during the part of the year defined as summertime then R will decide to use the summertime version of the timezone. If you want to set dates that don't use summertime versions then define them as GMT from the beginning.

formatString = "%Y-%m-%d %H:%M:%OS"
x = as.POSIXct(strptime("2013-11-23 23:10:38.000000", formatString), tz="GMT")
y = as.POSIXct(strptime("2015-07-17 01:43:38.000000", formatString), tz="GMT")

If you want to truncate out the time, don't use as.Date on a POSIXct object since as.Date is meant to convert character objects to Date objects (which aren't the same as POSIXct objects). If you want to truncate POSIXct objects with base R then you'll have to wrap either round or trunc in as.POSIXct but I would recommend checking out the lubridate package for dealing with dates and times (specifically POSIXct objects).

If you want to keep CET but never use CEST you can use a location that doesn't observe daylight savings. According to http://www.timeanddate.com/time/zones/cet your only options are Algeria and Tunisia. According to https://en.wikipedia.org/wiki/List_of_tz_database_time_zones the valid tz would be "Africa/Algiers". Therefore you could do

 formatString = "%Y-%m-%d %H:%M:%OS"
x = as.POSIXct(strptime("2013-11-23 23:10:38.000000", formatString), tz="Africa/Algiers")
y = as.POSIXct(strptime("2015-07-17 01:43:38.000000", formatString), tz="Africa/Algiers")

and both x and y would be in CET.

One more thing about setting timezones. If you tell R you want a generic timezone then it won't override daylight savings settings. That's why setting attr(y, "tzone") <- "CET" didn't have the desired result. If you did attr(y, "tzone") <- "Africa/Algiers" then it would have worked as you expected. Do be careful with conversions though because when you change the timezone it will change the time to account for the new timezone. The package lubridate has the function force_tz which changes the timezone without changing the time for cases where the initial timezone setting was wrong but the time was right.

How to fast convert different time formats in large data frames?

We may use %OS instead of %S to account for decimals in seconds.

help("strptime")

Specific to R is %OSn, which for output gives the seconds truncated to
0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it
uses the setting of getOption("digits.secs"), or if that is unset, n =
0).

as.POSIXct(time, format="%Y-%m-%dT%H:%M:%OSZ")
# [1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST"
# [3] "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"

This base R code is considerably faster than the package solutions, try it yourself.

Update 1

time2 <- c("2018-09-01T12:42:37.000+02:00", "2018-10-01T11:42:37.000+03:00")

This one is trickier. ?strptime says we should use %z for offsets from UTC, but somehow it won't work with as.POSIXct. Instead we could do this,

as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + 
  {os <- as.numeric(el(strsplit(substring(time2, 24), "\\:")))
  (os[1]*60 + os[2])*60}
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"

which cuts the unreadable part from the string, converts it to seconds and adds it to the "POSIXct" object.

If there are only hours as in time2, we could also say:

as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + 
  as.numeric(substr(time2, 24, 26))*3600
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"

That the code is slightly longer now should not obscure the fact that it runs practically as fast as the one at top of the answer.

Update 2

You could wrap the current three variants into a function with if (nchar(x) == 29) ... else structure, such as this one:

fixDateTime <- function(x) {
  s <- split(x, nchar(x))
  if ("20" %in% names(s))
    s$`20` <- as.POSIXct(s$`20` , format="%Y-%m-%dT%H:%M:%SZ")
  else if ("24" %in% names(s))
    s$`24` <- as.POSIXct(s$`24`, format="%Y-%m-%dT%H:%M:%OSZ")
  else if ("29" %in% names(s))
    s$`29` <- as.POSIXct(substr(s$`29`, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + 
      {os <- as.numeric(el(strsplit(substring(s[[3]], 24), "\\:")))
      (os[1]*60 + os[2])*60}
  return(unsplit(s, nchar(x)))
}

res <- fixDateTime(time3)
res
# [1] "2018-07-29 15:02:05 CEST" "2018-10-04 00:00:00 CEST" "2018-10-01 00:00:00 CEST"
str(res)
# POSIXct[1:3], format: "2018-07-29 15:02:05" "2018-10-04 00:00:00" "2018-10-01 00:00:00"

Compared to the packages only fixDateTime can handle all three defined date-time types. According to the concluding benchmark the function is still very fast.

Note: The function logically fails if different date formats have the same nchar, and it should be customized in the case (e.g. by another split condition)! Not tested: daylight saving time behavior when adding seconds to POSIXct.

Benchmark

# Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval  cld
# fixDateTime  35.46387  35.94761  40.07578  36.05923  39.54706  68.46211    10   c 
#  as.POSIXct  20.32820  20.45985  21.00461  20.62237  21.16019  23.56434    10  b   # to compare
#   lubridate  11.59311  11.68956  12.88880  12.01077  13.76151  16.54479    10 a    # produces NAs! 
#     anytime 198.57292 201.06483 203.95131 202.91368 203.62130 212.83272    10    d # produces NAs!

Data

time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z", "2018-10-04T12:13:41.333Z", 
"2018-10-04T12:13:45.479Z")
time2 <- c("2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z") 
time3 <- c("2018-07-29T15:02:05Z", "2018-10-04T12:13:41.333Z", 
           "2018-10-01T11:42:37.000+03:00")

Benchmark code

n <-  1e3
t1 <- sample(time2, n, replace=TRUE)
t2 <- sample(time3, n, replace=TRUE)

library(lubridate)
library(anytime)
microbenchmark::microbenchmark(fixDateTime=fixDateTime(t2),
                               as.POSIXct=as.POSIXct(t1, format="%Y-%m-%dT%H:%M:%OSZ"),
                               lubridate=parse_date_time(t2, "ymd_HMS"),
                               anytime=anytime(t2),
                               times=10L)

Why Are Lubridate Functions So Slow When Compared with As.Posixct