Speedup Conversion of 2 Million Rows of Date Strings to Posix.Ct

Speedup conversion of 2 million rows of date strings to POSIX.ct

You want the small and simple fasttime package by Simon which does this in the fastest possible way---by not calling time parsing functions but just using C-level string functions.

It does not support as many formats as strptime. In fact, it doesn't even have a format string. But well-formed ISO format variants, that is yyyy-mm-dd hh:mm:ss.fff will work, and your / separator may just work too.

How to most efficiently convert a character string of 01 Jan 2014 to POSIXct i.e. 2014-01-01 yyyy-mm-dd

What about using lubridate:

x <- "01 Jan 2014"
x
[1] "01 Jan 2014"
library(lubridate)
dmy(x)
[1] "2014-01-01 UTC"

Of course lubridate functions accept tz argument too. To see a complete list of acceptable arguments see OlsonNames()

Benchmark

I decided to update this answer with some empirical data using the micro benchmark package and the lubridate option for use fasstime.

library(micro benchmark)
microbenchmark(dmy(x), times = 10000)
Unit: milliseconds
expr min lq mean median uq max neval
dmy(x) 1.992639 2.02567 2.142212 2.041514 2.07153 39.1384 10000

options(lubridate.fasttime = T)

microbenchmark(dmy(x), times = 10000)
Unit: milliseconds
expr min lq mean median uq max neval
dmy(x) 1.993326 2.02488 2.136748 2.039467 2.065326 163.2008 10000

Why is as.Date slow on a character vector?

I think it's just that as.Date converts character to Date via POSIXlt, using strptime. And strptime is very slow, I believe.

To trace it through yourself, type as.Date, then methods(as.Date), then look at the character method.

> as.Date
function (x, ...)
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>

> methods(as.Date)
[1] as.Date.character as.Date.date as.Date.dates as.Date.default
[5] as.Date.factor as.Date.IDate* as.Date.numeric as.Date.POSIXct
[9] as.Date.POSIXlt
Non-visible functions are asterisked

> as.Date.character
function (x, format = "", ...)
{
charToDate <- function(x) {
xx <- x[1L]
if (is.na(xx)) {
j <- 1L
while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
if (is.na(xx))
f <- "%Y-%m-%d"
}
if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d",
tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d",
tz = "GMT")))
return(strptime(x, f))
stop("character string is not in a standard unambiguous format")
}
res <- if (missing(format))
charToDate(x)
else strptime(x, format, tz = "GMT") #### slow part, I think ####
as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
>

Why is as.POSIXlt(Date)$year+1900 relatively fast? Again, trace it through :

> as.POSIXct
function (x, tz = "", ...)
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>

> methods(as.POSIXct)
[1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default
[5] as.POSIXct.IDate* as.POSIXct.ITime* as.POSIXct.numeric as.POSIXct.POSIXlt
Non-visible functions are asterisked

> as.POSIXlt.Date
function (x, ...)
{
y <- .Internal(Date2POSIXlt(x))
names(y$year) <- names(x)
y
}
<bytecode: 0x395e328>
<environment: namespace:base>
>

Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.

~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$

Now we know we need to look for D2POSIXlt :

~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$

Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :

datetime.c

Search in there for D2POSIXlt and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!

So the crux of the issue (I think) is why strptime is so slow, and maybe that can be improved in R. Or just avoid POSIXlt, either directly or indirectly.


Here's a reproducible example using the number of items stated in question (3,000,000) :

> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
user system elapsed
21.681 0.060 21.760
> system.time(strptime(Date, "%m/%d/%Y"))
user system elapsed
29.594 8.633 38.270
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
user system elapsed
19.785 0.000 19.802

Passing tz appears to speed up strptime, which as.Date.character does. So maybe it depends on your locale. But strptime appears to be the culprit, not data.table. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?

Faster date formatting in R?

Since I wrote this before it was pointed out this is a duplicate, I'll add it as an answer anyway. Basically package fasttime can help you IF you have dates AFTER 1970-01-01 00:00:00 AND they are GMT AND they are of the format year, month, day, hour, minute, second. If you can rewrite your dates to this format then fastPOSIXct will be quick:

#  data
date <- c( "2013/5/31 23:30" , "2013/5/31 23:35" , "2013/5/31 23:40" , "2013/5/31 23:45" )

require(fasttime)
# fasttime function
dates.ft <- fastPOSIXct( date , tz = "GMT" )

# base function
dates <- as.POSIXct( date , format= "%Y/%m/%d %H:%M")

# rough comparison
require(microbenchmark)
microbenchmark( fastPOSIXct( date , tz = "GMT" ) , as.POSIXct( date , format= "%Y/%m/%d %H:%M") , times = 100L )
#Unit: microseconds
# expr min lq median uq max neval
# fastPOSIXct(date, tz = "GMT") 19.598 21.699 24.148 25.5485 215.927 100
# as.POSIXct(date, format = "%Y/%m/%d %H:%M") 160.633 163.433 168.332 181.9800 278.220 100

But the question would be, is it quicker to transform your dates to a format fasttime can accept or just use as.POSIXct or buy a faster computer?!

Slow String to Date Conversion Function

Just do:

as.Date(dates, format = "%m/%d/%Y")
  1. You don't need to loop over the dates vector as as.Date() can handle a vector of characters just fine in a single shot. Your function is incurring length(dates) calls to as.Date() plus some assignments to other functions, which all have overhead that is totally unnecessary.
  2. You don't want to convert each individual date to a factor. You don't want to convert them at all (as.Date() will just convert them back to characters). If you did want to convert them, factor() is also vectorised, so you could (but you don't need this at all, anywhere in your function) remove the factor() line and insert dates <- as.factor(dates) outside the for() loop. But again, you don't need to do this at all!

convert character to date *quickly* in R

Simon Urbanek's fasttime library is very fast for a subset of parseable datetimes:

R> now <- Sys.time()
R> now
[1] "2012-10-15 10:07:28.981 CDT"
R> fasttime::fastPOSIXct(format(now))
[1] "2012-10-15 05:07:28.980 CDT"
R> as.Date(fasttime::fastPOSIXct(format(now)))
[1] "2012-10-15"
R>

However, it only parse ISO formats and assume UTC as timezone.

Edit after 3 1/2 years: Some commenters appear to think that the fasttime package is difficult to install. I beg to differ. Here is (once again) use install.r which is just a simple wrapper using littler (and also shipped as an example with):

edd@max:~$ install.r fasttime
trying URL 'https://cran.rstudio.com/src/contrib/fasttime_1.0-1.tar.gz'
Content type 'application/x-gzip' length 2646 bytes
==================================================
downloaded 2646 bytes

* installing *source* package ‘fasttime’ ...
** package ‘fasttime’ successfully unpacked and MD5 sums checked
** libs
ccache gcc -I/usr/share/R/include -DNDEBUG -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -O3 -Wall -pipe -pedantic -std=gnu99 -c tparse.c -o tparse.o
ccache gcc -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o fasttime.so tparse.o -L/usr/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/fasttime/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (fasttime)

The downloaded source packages are in
‘/tmp/downloaded_packages’
edd@max:~$

As you can see, the package has zero external dependencies, one source file and builds without the slightest hitch. We can also see that fasttime is now on CRAN which was not the case when the answer was written. With that, Windows and OS X binaries now do exist at that page and the installation will be as easy as it was for me even when you do not install from source.

Populate a large data frame with calculated values

The following should work. Suppose we generate a data frame of 2 million rows:

> N <- 2e6
> R <- data.frame(year = sample(2000:2009,N,TRUE),
+ dayofyear = sample(365,N,TRUE),
+ time = floor(runif(N,0,12))*100+floor(runif(N,0,60)),
+ humidity = 99,
+ temp = floor(runif(N,15,40)))
> R$date <- as.Date(with(R,strptime(paste(year,dayofyear),
+ "%Y %j", tz="GMT")))
> nrow(R)
[1] 2000000
> head(R)
year dayofyear time humidity temp date
1 2000 206 307 99 39 2000-07-24
2 2009 101 1019 99 16 2009-04-11
3 2004 307 547 99 21 2004-11-02
4 2003 270 1158 99 33 2003-09-27
5 2006 21 330 99 22 2006-01-21
6 2005 154 516 99 21 2005-06-03
>

In this case, date is already a Date column, but if yours is a character column, then:

> R$date <- as.Date(R$date)

should only take a few seconds.

Now, get a list of all the unique date values. This should be quite fast:

> dates <- unique(R$date)
> print(length(dates))
[1] 3650
>

Now, run getSunlightTimes on this vector. This only took a couple of seconds on my machine using suncalc version 0.4 and R version 3.4.4:

> times <- suncalc::getSunlightTimes(dates, lat=0, lon=0)

Now, generate an index vector giving the index of each date in R$date within the vector of unique dates dates:

> i <- match(R$date, dates)

Now, select rows of the times dataframe by this same index:

> solarNoons <- times[i,]
> nrow(solarNoons)
[1] 2000000
>

If we pick a row of R:

> R[1234567,]
year dayofyear time humidity temp date
1234567 2002 24 535 99 17 2002-01-24

you'll see that the corresponding row of solarNoons is the result for that date:

> solarNoons[1234567,]
date lat lon solarNoon nadir
2616.352 2002-01-24 12:00:00 0 0 2002-01-24 12:13:14 2002-01-24 00:13:14
sunrise sunset sunriseEnd
2616.352 2002-01-24 06:09:42 2002-01-24 18:16:46 2002-01-24 06:11:58
sunsetStart dawn dusk
2616.352 2002-01-24 18:14:30 2002-01-24 05:47:49 2002-01-24 18:38:39
nauticalDawn nauticalDusk nightEnd
2616.352 2002-01-24 05:22:22 2002-01-24 19:04:06 2002-01-24 04:56:50
night goldenHourEnd goldenHour
2616.352 2002-01-24 19:29:38 2002-01-24 06:38:39 2002-01-24 17:47:49
>

If you want, you can merge the two data frames together:

> R2 <- cbind(R, solarNoons)

This all assumes that "1.65 MM" meant 1.65 million. If you meant 1.65 million million (i.e., an American trillion), then you're going to need a bigger computer.



Related Topics



Leave a reply



Submit