Why Is As.Date Slow on a Character Vector

Why is as.Date slow on a character vector?

I think it's just that as.Date converts character to Date via POSIXlt, using strptime. And strptime is very slow, I believe.

To trace it through yourself, type as.Date, then methods(as.Date), then look at the character method.

> as.Date
function (x, ...)
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>

> methods(as.Date)
[1] as.Date.character as.Date.date as.Date.dates as.Date.default
[5] as.Date.factor as.Date.IDate* as.Date.numeric as.Date.POSIXct
[9] as.Date.POSIXlt
Non-visible functions are asterisked

> as.Date.character
function (x, format = "", ...)
{
charToDate <- function(x) {
xx <- x[1L]
if (is.na(xx)) {
j <- 1L
while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
if (is.na(xx))
f <- "%Y-%m-%d"
}
if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d",
tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d",
tz = "GMT")))
return(strptime(x, f))
stop("character string is not in a standard unambiguous format")
}
res <- if (missing(format))
charToDate(x)
else strptime(x, format, tz = "GMT") #### slow part, I think ####
as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
>

Why is as.POSIXlt(Date)$year+1900 relatively fast? Again, trace it through :

> as.POSIXct
function (x, tz = "", ...)
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>

> methods(as.POSIXct)
[1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default
[5] as.POSIXct.IDate* as.POSIXct.ITime* as.POSIXct.numeric as.POSIXct.POSIXlt
Non-visible functions are asterisked

> as.POSIXlt.Date
function (x, ...)
{
y <- .Internal(Date2POSIXlt(x))
names(y$year) <- names(x)
y
}
<bytecode: 0x395e328>
<environment: namespace:base>
>

Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.

~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$

Now we know we need to look for D2POSIXlt :

~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$

Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :

datetime.c

Search in there for D2POSIXlt and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!

So the crux of the issue (I think) is why strptime is so slow, and maybe that can be improved in R. Or just avoid POSIXlt, either directly or indirectly.


Here's a reproducible example using the number of items stated in question (3,000,000) :

> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
user system elapsed
21.681 0.060 21.760
> system.time(strptime(Date, "%m/%d/%Y"))
user system elapsed
29.594 8.633 38.270
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
user system elapsed
19.785 0.000 19.802

Passing tz appears to speed up strptime, which as.Date.character does. So maybe it depends on your locale. But strptime appears to be the culprit, not data.table. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?

Slow String to Date Conversion Function

Just do:

as.Date(dates, format = "%m/%d/%Y")
  1. You don't need to loop over the dates vector as as.Date() can handle a vector of characters just fine in a single shot. Your function is incurring length(dates) calls to as.Date() plus some assignments to other functions, which all have overhead that is totally unnecessary.
  2. You don't want to convert each individual date to a factor. You don't want to convert them at all (as.Date() will just convert them back to characters). If you did want to convert them, factor() is also vectorised, so you could (but you don't need this at all, anywhere in your function) remove the factor() line and insert dates <- as.factor(dates) outside the for() loop. But again, you don't need to do this at all!

convert string date to R Date FAST for all dates

I can get a little speedup by using the date package:

library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
# user system elapsed
# 6.54 0.01 6.56
system.time(ddate <- as.Date(as.date(x,"ymd")))
# user system elapsed
# 3.42 0.22 3.64

You might want to look at the C code it uses and see if you can modify it to be faster for your specific situation.

Using sapply on a vector of dates: Function very slow. Why?

As has been pointed out in the comments, passing the vector of dates directly to the function is way faster. Additionally, ifelse has a ton of overhead, so substituting ifelse(month(date)>=6, 0, -1) with floor((x/5.6) - (x^2)*0.001) - 1L will be much faster.

DetermineWaterYearNew <- function(date, return.interval=FALSE){
x <- month(date)
wy <- year(date) + floor((x/5.6) - (x^2)*0.001) - 1L
if(return.interval==FALSE){
return(wy)
} else {
interval <- interval(ymd(cat(wy),'06-01', sep=''), ymd(cat(wy+1),'05-31', sep=''))
return(interval)
}
}

Here are some benchmarks:

microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:1000]),
OldVectorized=DetermineWaterYear(tempdates[1:1000]),
NonVectorized=sapply(tempdates[1:1000],DetermineWaterYear))
Unit: microseconds
expr min lq mean median uq max neval
NewVectorized 341.954 364.1215 418.7311 395.7300 460.7955 602.627 100
OldVectorized 417.077 437.3970 496.0585 462.8485 545.1555 802.954 100
NonVectorized 42601.719 45148.3070 46452.6843 45902.4100 47341.2415 62898.476 100

Only comparing the vectorized solutions on the full gamut of dates we have:

microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:190000]),
OldVectorized=DetermineWaterYear(tempdates[1:190000]))
Unit: milliseconds
expr min lq mean median uq max neval
NewVectorized 26.30660 27.26575 28.97715 27.84169 29.19391 102.1697 100
OldVectorized 38.98637 40.78153 44.07461 42.55287 43.77947 114.9616 100

Converting datetime character vector into date-time format

You can use as_datetime function from lubridate package

library(lubridate)
#> Warning: package 'lubridate' was built under R version 3.6.3
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
dttime = c("2021-08-03 11:59:59", "2021-08-03 12:59:59",
"2021-08-03", "2021-08-03 16:59:59")
as_datetime(dttime, tz = "UTC")
#> [1] "2021-08-03 11:59:59 UTC" "2021-08-03 12:59:59 UTC"
#> [3] "2021-08-03 00:00:00 UTC" "2021-08-03 16:59:59 UTC"

You can change the timezone into another timezone, see ?as_datetime

Why are my functions on lubridate dates so slow?

You're looping over every single row. It's not surprising it is slow. You could essentially do one replacement operation instead where you take a fixed difference from each date: 0 for M-F, -1 for Sat and -2 for Sun.

# 'big' sample data
x <- Sys.Date() + 0:100000

bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)

# since `weekdays()` is locale-specific, you could also be defensive and do:
bizdays <- function(x) x - match(format(x, "%w"), c("6","0"), nomatch=0)

system.time(bizdays(x))
# user system elapsed
# 0.36 0.00 0.35

system.time(previous_business_date_if_weekend(x))
# user system elapsed
# 45.45 0.00 45.57

identical(bizdays(x), previous_business_date_if_weekend(x))
#[1] TRUE

convert character to date *quickly* in R

Simon Urbanek's fasttime library is very fast for a subset of parseable datetimes:

R> now <- Sys.time()
R> now
[1] "2012-10-15 10:07:28.981 CDT"
R> fasttime::fastPOSIXct(format(now))
[1] "2012-10-15 05:07:28.980 CDT"
R> as.Date(fasttime::fastPOSIXct(format(now)))
[1] "2012-10-15"
R>

However, it only parse ISO formats and assume UTC as timezone.

Edit after 3 1/2 years: Some commenters appear to think that the fasttime package is difficult to install. I beg to differ. Here is (once again) use install.r which is just a simple wrapper using littler (and also shipped as an example with):

edd@max:~$ install.r fasttime
trying URL 'https://cran.rstudio.com/src/contrib/fasttime_1.0-1.tar.gz'
Content type 'application/x-gzip' length 2646 bytes
==================================================
downloaded 2646 bytes

* installing *source* package ‘fasttime’ ...
** package ‘fasttime’ successfully unpacked and MD5 sums checked
** libs
ccache gcc -I/usr/share/R/include -DNDEBUG -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -O3 -Wall -pipe -pedantic -std=gnu99 -c tparse.c -o tparse.o
ccache gcc -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o fasttime.so tparse.o -L/usr/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/fasttime/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (fasttime)

The downloaded source packages are in
‘/tmp/downloaded_packages’
edd@max:~$

As you can see, the package has zero external dependencies, one source file and builds without the slightest hitch. We can also see that fasttime is now on CRAN which was not the case when the answer was written. With that, Windows and OS X binaries now do exist at that page and the installation will be as easy as it was for me even when you do not install from source.

Why does as.Date return NA in one case, and doesn't return in another?

The parsing of date strings depends on the machine's language settings. If you want to work with english date strings, set the locale to (british or american) English:

> Sys.setlocale("LC_ALL", 'en_GB.UTF-8')
[1] "LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=es_ES.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=es_ES.UTF-8;LC_IDENTIFICATION=C"
> as.Date('Dec 15, 2000', format = '%b %d, %Y')
[1] "2000-12-15"

Edit

To be more specific, the environment variable LC_TIME is the one that determines the parsing behaviour of date strings:

Sys.setlocale("LC_TIME", 'en_GB.UTF-8')

Why is by on a vector not from a data.table column very slow?

Seems like I forgot to update this post.

This was fixed long back in commit #1039 of v1.8.11. From NEWS:

Fixed #5106 where DT[, .N, by=y] where y is a vector with length(y) = nrow(DT), but y is not a column in DT. Thanks to colinfang for reporting.

Testing on v1.8.11 commit 1187:

require(data.table)
test <- data.table(x=sample.int(10, 1000000, replace=TRUE))
y <- test$x

system.time(ans1 <- test[,.N, by=x])
# user system elapsed
# 0.015 0.000 0.016

system.time(ans2 <- test[,.N, by=y])
# user system elapsed
# 0.015 0.000 0.015

setnames(ans2, "y", "x")
identical(ans1, ans2) # [1] TRUE


Related Topics



Leave a reply



Submit