Speedup Conversion of 2 Million Rows of Date Strings to Posix.Ct

Speedup conversion of 2 million rows of date strings to POSIX.ct

You want the small and simple fasttime package by Simon which does this in the fastest possible way---by not calling time parsing functions but just using C-level string functions.

It does not support as many formats as strptime. In fact, it doesn't even have a format string. But well-formed ISO format variants, that is yyyy-mm-dd hh:mm:ss.fff will work, and your / separator may just work too.

How to most efficiently convert a character string of 01 Jan 2014 to POSIXct i.e. 2014-01-01 yyyy-mm-dd

What about using lubridate:

x <- "01 Jan 2014"
x
[1] "01 Jan 2014"
library(lubridate)
dmy(x)
[1] "2014-01-01 UTC"

Of course lubridate functions accept tz argument too. To see a complete list of acceptable arguments see OlsonNames()

Benchmark

I decided to update this answer with some empirical data using the micro benchmark package and the lubridate option for use fasstime.

library(micro benchmark)
microbenchmark(dmy(x), times = 10000)
Unit: milliseconds
   expr      min      lq     mean   median      uq     max neval
 dmy(x) 1.992639 2.02567 2.142212 2.041514 2.07153 39.1384 10000

options(lubridate.fasttime = T)

microbenchmark(dmy(x), times = 10000)
Unit: milliseconds
   expr      min      lq     mean   median       uq      max neval
 dmy(x) 1.993326 2.02488 2.136748 2.039467 2.065326 163.2008 10000

Why is as.Date slow on a character vector?

I think it's just that as.Date converts character to Date via POSIXlt, using strptime. And strptime is very slow, I believe.

To trace it through yourself, type as.Date, then methods(as.Date), then look at the character method.

> as.Date
function (x, ...) 
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>

> methods(as.Date)
[1] as.Date.character as.Date.date      as.Date.dates     as.Date.default  
[5] as.Date.factor    as.Date.IDate*    as.Date.numeric   as.Date.POSIXct  
[9] as.Date.POSIXlt  
   Non-visible functions are asterisked

> as.Date.character
function (x, format = "", ...) 
{
    charToDate <- function(x) {
        xx <- x[1L]
        if (is.na(xx)) {
            j <- 1L
            while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
            if (is.na(xx)) 
                f <- "%Y-%m-%d"
        }
        if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d", 
            tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d", 
            tz = "GMT"))) 
            return(strptime(x, f))
        stop("character string is not in a standard unambiguous format")
    }
    res <- if (missing(format)) 
        charToDate(x)
    else strptime(x, format, tz = "GMT")       ####  slow part, I think  ####
    as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
>

Why is as.POSIXlt(Date)$year+1900 relatively fast? Again, trace it through :

> as.POSIXct
function (x, tz = "", ...) 
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>

> methods(as.POSIXct)
[1] as.POSIXct.date    as.POSIXct.Date    as.POSIXct.dates   as.POSIXct.default
[5] as.POSIXct.IDate*  as.POSIXct.ITime*  as.POSIXct.numeric as.POSIXct.POSIXlt
   Non-visible functions are asterisked

> as.POSIXlt.Date
function (x, ...) 
{
    y <- .Internal(Date2POSIXlt(x))
    names(y$year) <- names(x)
    y
}
<bytecode: 0x395e328>
<environment: namespace:base>
>

Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.

~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

Now we know we need to look for D2POSIXlt :

~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :

datetime.c

Search in there for D2POSIXlt and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!

So the crux of the issue (I think) is why strptime is so slow, and maybe that can be improved in R. Or just avoid POSIXlt, either directly or indirectly.

Here's a reproducible example using the number of items stated in question (3,000,000) :

> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
   user  system elapsed 
 21.681   0.060  21.760 
> system.time(strptime(Date, "%m/%d/%Y"))
   user  system elapsed 
 29.594   8.633  38.270 
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
   user  system elapsed 
 19.785   0.000  19.802

Passing tz appears to speed up strptime, which as.Date.character does. So maybe it depends on your locale. But strptime appears to be the culprit, not data.table. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?

Faster date formatting in R?

Since I wrote this before it was pointed out this is a duplicate, I'll add it as an answer anyway. Basically package fasttime can help you IF you have dates AFTER 1970-01-01 00:00:00 AND they are GMT AND they are of the format year, month, day, hour, minute, second. If you can rewrite your dates to this format then fastPOSIXct will be quick:

#  data
date <- c( "2013/5/31 23:30" , "2013/5/31 23:35" , "2013/5/31 23:40" , "2013/5/31 23:45" )

require(fasttime)
#  fasttime function
dates.ft <- fastPOSIXct( date , tz = "GMT" )

#  base function
dates <- as.POSIXct( date , format= "%Y/%m/%d %H:%M")    

#  rough comparison
require(microbenchmark)
microbenchmark( fastPOSIXct( date , tz = "GMT" ) , as.POSIXct( date , format= "%Y/%m/%d %H:%M") , times = 100L )
#Unit: microseconds
#                                        expr     min      lq  median       uq     max neval
#               fastPOSIXct(date, tz = "GMT")  19.598  21.699  24.148  25.5485 215.927   100
# as.POSIXct(date, format = "%Y/%m/%d %H:%M") 160.633 163.433 168.332 181.9800 278.220   100

But the question would be, is it quicker to transform your dates to a format fasttime can accept or just use as.POSIXct or buy a faster computer?!

Slow String to Date Conversion Function

Just do:

as.Date(dates, format = "%m/%d/%Y")

You don't need to loop over the dates vector as as.Date() can handle a vector of characters just fine in a single shot. Your function is incurring length(dates) calls to as.Date() plus some assignments to other functions, which all have overhead that is totally unnecessary.
You don't want to convert each individual date to a factor. You don't want to convert them at all (as.Date() will just convert them back to characters). If you did want to convert them, factor() is also vectorised, so you could (but you don't need this at all, anywhere in your function) remove the factor() line and insert dates <- as.factor(dates) outside the for() loop. But again, you don't need to do this at all!

convert character to date quickly in R

Simon Urbanek's fasttime library is very fast for a subset of parseable datetimes:

R> now <- Sys.time()
R> now
[1] "2012-10-15 10:07:28.981 CDT"
R> fasttime::fastPOSIXct(format(now))
[1] "2012-10-15 05:07:28.980 CDT"
R> as.Date(fasttime::fastPOSIXct(format(now)))
[1] "2012-10-15"
R>

However, it only parse ISO formats and assume UTC as timezone.

Edit after 3 1/2 years: Some commenters appear to think that the fasttime package is difficult to install. I beg to differ. Here is (once again) use install.r which is just a simple wrapper using littler (and also shipped as an example with):

edd@max:~$ install.r fasttime
trying URL 'https://cran.rstudio.com/src/contrib/fasttime_1.0-1.tar.gz'
Content type 'application/x-gzip' length 2646 bytes
==================================================
downloaded 2646 bytes

* installing *source* package ‘fasttime’ ...
** package ‘fasttime’ successfully unpacked and MD5 sums checked
** libs
ccache gcc -I/usr/share/R/include -DNDEBUG      -fpic  -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g  -O3 -Wall -pipe -pedantic -std=gnu99  -c tparse.c -o tparse.o
ccache gcc -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o fasttime.so tparse.o -L/usr/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/fasttime/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (fasttime)

The downloaded source packages are in
        ‘/tmp/downloaded_packages’
edd@max:~$

As you can see, the package has zero external dependencies, one source file and builds without the slightest hitch. We can also see that fasttime is now on CRAN which was not the case when the answer was written. With that, Windows and OS X binaries now do exist at that page and the installation will be as easy as it was for me even when you do not install from source.

Populate a large data frame with calculated values

The following should work. Suppose we generate a data frame of 2 million rows:

> N <- 2e6
> R <- data.frame(year = sample(2000:2009,N,TRUE),
+                 dayofyear = sample(365,N,TRUE),
+                 time = floor(runif(N,0,12))*100+floor(runif(N,0,60)),
+                 humidity = 99,
+                 temp = floor(runif(N,15,40)))
> R$date <- as.Date(with(R,strptime(paste(year,dayofyear),
+                                   "%Y %j", tz="GMT")))
> nrow(R)
[1] 2000000
> head(R)
  year dayofyear time humidity temp       date
1 2000       206  307       99   39 2000-07-24
2 2009       101 1019       99   16 2009-04-11
3 2004       307  547       99   21 2004-11-02
4 2003       270 1158       99   33 2003-09-27
5 2006        21  330       99   22 2006-01-21
6 2005       154  516       99   21 2005-06-03
>

In this case, date is already a Date column, but if yours is a character column, then:

> R$date <- as.Date(R$date)

should only take a few seconds.

Now, get a list of all the unique date values. This should be quite fast:

> dates <- unique(R$date)
> print(length(dates))
[1] 3650
>

Now, run getSunlightTimes on this vector. This only took a couple of seconds on my machine using suncalc version 0.4 and R version 3.4.4:

> times <- suncalc::getSunlightTimes(dates, lat=0, lon=0)

Now, generate an index vector giving the index of each date in R$date within the vector of unique dates dates:

> i <- match(R$date, dates)

Now, select rows of the times dataframe by this same index:

> solarNoons <- times[i,]
> nrow(solarNoons)
[1] 2000000
>

If we pick a row of R:

> R[1234567,]
        year dayofyear time humidity temp       date
1234567 2002        24  535       99   17 2002-01-24

you'll see that the corresponding row of solarNoons is the result for that date:

> solarNoons[1234567,]
                        date lat lon           solarNoon               nadir
2616.352 2002-01-24 12:00:00   0   0 2002-01-24 12:13:14 2002-01-24 00:13:14
                     sunrise              sunset          sunriseEnd
2616.352 2002-01-24 06:09:42 2002-01-24 18:16:46 2002-01-24 06:11:58
                 sunsetStart                dawn                dusk
2616.352 2002-01-24 18:14:30 2002-01-24 05:47:49 2002-01-24 18:38:39
                nauticalDawn        nauticalDusk            nightEnd
2616.352 2002-01-24 05:22:22 2002-01-24 19:04:06 2002-01-24 04:56:50
                       night       goldenHourEnd          goldenHour
2616.352 2002-01-24 19:29:38 2002-01-24 06:38:39 2002-01-24 17:47:49
>

If you want, you can merge the two data frames together:

> R2 <- cbind(R, solarNoons)

This all assumes that "1.65 MM" meant 1.65 million. If you meant 1.65 million million (i.e., an American trillion), then you're going to need a bigger computer.

Speedup Conversion of 2 Million Rows of Date Strings to Posix.Ct