Why is as.Date slow on a character vector?
I think it's just that as.Date
converts character
to Date
via POSIXlt
, using strptime
. And strptime
is very slow, I believe.
To trace it through yourself, type as.Date
, then methods(as.Date)
, then look at the character
method.
> as.Date
function (x, ...)
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>
> methods(as.Date)
[1] as.Date.character as.Date.date as.Date.dates as.Date.default
[5] as.Date.factor as.Date.IDate* as.Date.numeric as.Date.POSIXct
[9] as.Date.POSIXlt
Non-visible functions are asterisked
> as.Date.character
function (x, format = "", ...)
{
charToDate <- function(x) {
xx <- x[1L]
if (is.na(xx)) {
j <- 1L
while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
if (is.na(xx))
f <- "%Y-%m-%d"
}
if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d",
tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d",
tz = "GMT")))
return(strptime(x, f))
stop("character string is not in a standard unambiguous format")
}
res <- if (missing(format))
charToDate(x)
else strptime(x, format, tz = "GMT") #### slow part, I think ####
as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
>
Why is as.POSIXlt(Date)$year+1900
relatively fast? Again, trace it through :
> as.POSIXct
function (x, tz = "", ...)
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>
> methods(as.POSIXct)
[1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default
[5] as.POSIXct.IDate* as.POSIXct.ITime* as.POSIXct.numeric as.POSIXct.POSIXlt
Non-visible functions are asterisked
> as.POSIXlt.Date
function (x, ...)
{
y <- .Internal(Date2POSIXlt(x))
names(y$year) <- names(x)
y
}
<bytecode: 0x395e328>
<environment: namespace:base>
>
Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.
~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
Now we know we need to look for D2POSIXlt :
~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :
datetime.c
Search in there for D2POSIXlt
and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!
So the crux of the issue (I think) is why strptime
is so slow, and maybe that can be improved in R. Or just avoid POSIXlt
, either directly or indirectly.
Here's a reproducible example using the number of items stated in question (3,000,000) :
> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
user system elapsed
21.681 0.060 21.760
> system.time(strptime(Date, "%m/%d/%Y"))
user system elapsed
29.594 8.633 38.270
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
user system elapsed
19.785 0.000 19.802
Passing tz
appears to speed up strptime
, which as.Date.character
does. So maybe it depends on your locale. But strptime
appears to be the culprit, not data.table
. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?
Slow String to Date Conversion Function
Just do:
as.Date(dates, format = "%m/%d/%Y")
- You don't need to loop over the
dates
vector asas.Date()
can handle a vector of characters just fine in a single shot. Your function is incurringlength(dates)
calls toas.Date()
plus some assignments to other functions, which all have overhead that is totally unnecessary. - You don't want to convert each individual date to a factor. You don't want to convert them at all (
as.Date()
will just convert them back to characters). If you did want to convert them,factor()
is also vectorised, so you could (but you don't need this at all, anywhere in your function) remove thefactor()
line and insertdates <- as.factor(dates)
outside thefor()
loop. But again, you don't need to do this at all!
convert string date to R Date FAST for all dates
I can get a little speedup by using the date
package:
library(date)
set.seed(21)
x <- as.character(Sys.Date()-sample(40000, 1e6, TRUE))
system.time(dDate <- as.Date(x))
# user system elapsed
# 6.54 0.01 6.56
system.time(ddate <- as.Date(as.date(x,"ymd")))
# user system elapsed
# 3.42 0.22 3.64
You might want to look at the C code it uses and see if you can modify it to be faster for your specific situation.
Using sapply on a vector of dates: Function very slow. Why?
As has been pointed out in the comments, passing the vector of dates directly to the function is way faster. Additionally, ifelse
has a ton of overhead, so substituting ifelse(month(date)>=6, 0, -1)
with floor((x/5.6) - (x^2)*0.001) - 1L
will be much faster.
DetermineWaterYearNew <- function(date, return.interval=FALSE){
x <- month(date)
wy <- year(date) + floor((x/5.6) - (x^2)*0.001) - 1L
if(return.interval==FALSE){
return(wy)
} else {
interval <- interval(ymd(cat(wy),'06-01', sep=''), ymd(cat(wy+1),'05-31', sep=''))
return(interval)
}
}
Here are some benchmarks:
microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:1000]),
OldVectorized=DetermineWaterYear(tempdates[1:1000]),
NonVectorized=sapply(tempdates[1:1000],DetermineWaterYear))
Unit: microseconds
expr min lq mean median uq max neval
NewVectorized 341.954 364.1215 418.7311 395.7300 460.7955 602.627 100
OldVectorized 417.077 437.3970 496.0585 462.8485 545.1555 802.954 100
NonVectorized 42601.719 45148.3070 46452.6843 45902.4100 47341.2415 62898.476 100
Only comparing the vectorized solutions on the full gamut of dates we have:
microbenchmark(NewVectorized=DetermineWaterYearNew(tempdates[1:190000]),
OldVectorized=DetermineWaterYear(tempdates[1:190000]))
Unit: milliseconds
expr min lq mean median uq max neval
NewVectorized 26.30660 27.26575 28.97715 27.84169 29.19391 102.1697 100
OldVectorized 38.98637 40.78153 44.07461 42.55287 43.77947 114.9616 100
Converting datetime character vector into date-time format
You can use as_datetime
function from lubridate package
library(lubridate)
#> Warning: package 'lubridate' was built under R version 3.6.3
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
dttime = c("2021-08-03 11:59:59", "2021-08-03 12:59:59",
"2021-08-03", "2021-08-03 16:59:59")
as_datetime(dttime, tz = "UTC")
#> [1] "2021-08-03 11:59:59 UTC" "2021-08-03 12:59:59 UTC"
#> [3] "2021-08-03 00:00:00 UTC" "2021-08-03 16:59:59 UTC"
You can change the timezone into another timezone, see ?as_datetime
Why are my functions on lubridate dates so slow?
You're looping over every single row. It's not surprising it is slow. You could essentially do one replacement operation instead where you take a fixed difference from each date: 0 for M-F, -1 for Sat and -2 for Sun.
# 'big' sample data
x <- Sys.Date() + 0:100000
bizdays <- function(x) x - match(weekdays(x), c("Saturday","Sunday"), nomatch=0)
# since `weekdays()` is locale-specific, you could also be defensive and do:
bizdays <- function(x) x - match(format(x, "%w"), c("6","0"), nomatch=0)
system.time(bizdays(x))
# user system elapsed
# 0.36 0.00 0.35
system.time(previous_business_date_if_weekend(x))
# user system elapsed
# 45.45 0.00 45.57
identical(bizdays(x), previous_business_date_if_weekend(x))
#[1] TRUE
convert character to date *quickly* in R
Simon Urbanek's fasttime library is very fast for a subset of parseable datetimes:
R> now <- Sys.time()
R> now
[1] "2012-10-15 10:07:28.981 CDT"
R> fasttime::fastPOSIXct(format(now))
[1] "2012-10-15 05:07:28.980 CDT"
R> as.Date(fasttime::fastPOSIXct(format(now)))
[1] "2012-10-15"
R>
However, it only parse ISO formats and assume UTC as timezone.
Edit after 3 1/2 years: Some commenters appear to think that the fasttime package is difficult to install. I beg to differ. Here is (once again) use install.r
which is just a simple wrapper using littler (and also shipped as an example with):
edd@max:~$ install.r fasttime
trying URL 'https://cran.rstudio.com/src/contrib/fasttime_1.0-1.tar.gz'
Content type 'application/x-gzip' length 2646 bytes
==================================================
downloaded 2646 bytes
* installing *source* package ‘fasttime’ ...
** package ‘fasttime’ successfully unpacked and MD5 sums checked
** libs
ccache gcc -I/usr/share/R/include -DNDEBUG -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -g -O3 -Wall -pipe -pedantic -std=gnu99 -c tparse.c -o tparse.o
ccache gcc -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o fasttime.so tparse.o -L/usr/lib/R/lib -lR
installing to /usr/local/lib/R/site-library/fasttime/libs
** R
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (fasttime)
The downloaded source packages are in
‘/tmp/downloaded_packages’
edd@max:~$
As you can see, the package has zero external dependencies, one source file and builds without the slightest hitch. We can also see that fasttime is now on CRAN which was not the case when the answer was written. With that, Windows and OS X binaries now do exist at that page and the installation will be as easy as it was for me even when you do not install from source.
Why does as.Date return NA in one case, and doesn't return in another?
The parsing of date strings depends on the machine's language settings. If you want to work with english date strings, set the locale to (british or american) English:
> Sys.setlocale("LC_ALL", 'en_GB.UTF-8')
[1] "LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=es_ES.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=es_ES.UTF-8;LC_IDENTIFICATION=C"
> as.Date('Dec 15, 2000', format = '%b %d, %Y')
[1] "2000-12-15"
Edit
To be more specific, the environment variable LC_TIME
is the one that determines the parsing behaviour of date strings:
Sys.setlocale("LC_TIME", 'en_GB.UTF-8')
Why is by on a vector not from a data.table column very slow?
Seems like I forgot to update this post.
This was fixed long back in commit #1039 of v1.8.11. From NEWS:
Fixed
#5106
whereDT[, .N, by=y]
wherey
is a vector withlength(y) = nrow(DT)
, buty
is not a column inDT
. Thanks tocolinfang
for reporting.
Testing on v1.8.11 commit 1187:
require(data.table)
test <- data.table(x=sample.int(10, 1000000, replace=TRUE))
y <- test$x
system.time(ans1 <- test[,.N, by=x])
# user system elapsed
# 0.015 0.000 0.016
system.time(ans2 <- test[,.N, by=y])
# user system elapsed
# 0.015 0.000 0.015
setnames(ans2, "y", "x")
identical(ans1, ans2) # [1] TRUE
Related Topics
How to Add a General Label to Facets in Ggplot2
Sort Columns of a Dataframe by Column Name
Apply a Function to Every Row of a Matrix or a Data Frame
How to Specify the Actual X Axis Values to Plot as X Axis Ticks in R
Converting Nested List to Dataframe
Function to Calculate R2 (R-Squared) in R
Pass a Vector of Variable Names to Arrange() in Dplyr
How to Display All X Labels in R Barplot
R - Group by Variable and Then Assign a Unique Id
Drop-Down Checkbox Input in Shiny
How to One Hot Encode Several Categorical Variables in R
Getting Strings Recognized as Variable Names in R
What's the Best Way to Use R Scripts on the Command Line (Terminal)
Multiplying All Elements of a Vector in R
Ggplot2: Facet_Wrap Strip Color Based on Variable in Data Set