Large Integers in Data.Table. Grouping Results Different in 1.9.2 Compared to 1.8.10

Large integers in data.table. Grouping results different in 1.9.2 compared to 1.8.10

Yes the result in v1.8.10 was the correct behaviour. We improved the method of rounding in v1.9.2. That's best explained here :

Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2

That meant we went backwards on supporting integers > 2^31 stored in type numeric. That's now addressed in v1.9.3 (available from R-Forge), see NEWS :

o bit64::integer64 now works in grouping and joins, #5369. Thanks to James Sams for highlighting UPCs and Clayton Stanley.

Reminder: fread() has been able to detect and read integer64 for a while.

o New function setNumericRounding() may be used to reduce to 1 byte
or 0 byte rounding when joining
to or grouping columns of type numeric, #5369. See example in ?setNumericRounding and NEWS
item from v1.9.2. getNumericRounding() returns the current setting.

So you can either call setNumericRounding(0) to switch off rounding globally for all numeric columns, or better, use the more appropriate type for the column: bit64::integer64 now that it's supported.

The change in v1.9.2 was :

o Numeric data is still joined and grouped within tolerance as before
but instead of tolerance being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle. A few functions provided a 'tolerance' argument but this wasn't being passed through so has been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release [DONE].

The example in ?setNumericRounding is :

> DT = data.table(a=seq(0,1,by=0.2),b=1:2, key="a")
> DT
a b
1: 0.0 1
2: 0.2 2
3: 0.4 1
4: 0.6 2
5: 0.8 1
6: 1.0 2
> setNumericRounding(0) # turn off rounding; i.e. if we didn't round
> DT[.(0.4)] # works
a b
1: 0.4 1
> DT[.(0.6)] # no match!, confusing to users
a b # 0.6 is clearing there in DT, and 0.4 worked ok!
1: 0.6 NA
>
> setNumericRounding(2) # restore default
> DT[.(0.6)] # now works as user expects
a b
1: 0.6 2
>
> # using type 'numeric' for integers > 2^31 (typically ids)
> DT = data.table(id = c(1234567890123, 1234567890124, 1234567890125), val=1:3)
> DT[,.N,by=id] # 1 row (the last digit has been rounded)
id N
1: 1.234568e+12 3
> setNumericRounding(0) # turn off rounding
> DT[,.N,by=id] # 3 rows (the last digit wasn't rounded)
id N
1: 1.234568e+12 1
2: 1.234568e+12 1
3: 1.234568e+12 1
> # but, better to use bit64::integer64 for such ids instead of numeric
> setNumericRounding(2) # restore default, preferred

Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2

It is worth reading R FAQ 7.31 and thinking about the accuracy of floating point represenations.

I can't reproduce this in the current cran version (1.9.2). using

R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)

My guess that the change in behaivour will be related to this news item.

o Numeric data is still joined and grouped within tolerance as before but instead of tolerance
being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default)
the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate
for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle.
A few functions provided a 'tolerance' argument but this wasn't being passed through so has
been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release.


Update from Matt

Yes this was a deliberate change in v1.9.2 and data.table now distinguishes 0.0000000000000000000000000001 from 0 (as user3340145 rightly thought it should) due to the improved rounding method highlighted above from NEWS.

I've also added the for loop test from Rick's answer to the test suite.

Btw, #5369 is now implemented in v1.9.3 (although neither of these are needed for this question) :

o bit64::integer64 now works in grouping and joins, #5369. Thanks to
James Sams for highlighting UPCs.

o New function setNumericRounding() may be used to reduce to 1 byte
or 0 byte rounding when joining to or grouping columns of type 'numeric', #5369.
See example in ?setNumericRounding and NEWS item from v1.9.2.
getNumericRounding() returns the current setting.

Notice that rounding is now (as from v1.9.2) about the accuracy of the significand; i.e. the number of significant figures. 0.0000000000000000000000000001 == 1.0e-28 is accurate to just 1 s.f., so the new rounding method doesn't group this together with 0.0.

In short, the answer to the question is : upgrade from v1.8.10 to v1.9.2 or greater.

Rounding milliseconds of POSIXct in data.table v1.9.2 (ok in 1.8.10)

Yes I reproduced your result with v1.9.2.

library(data.table)

DT <- data.table(timestamp=c(as.POSIXct("2013-01-01 17:51:00.707"),
as.POSIXct("2013-01-01 17:51:59.996"),
as.POSIXct("2013-01-01 17:52:00.059"),
as.POSIXct("2013-01-01 17:54:23.901"),
as.POSIXct("2013-01-01 17:54:23.914")))

options(digits.secs=3) # usually placed in .Rprofile

DT
timestamp
1: 2013-01-01 17:51:00.707
2: 2013-01-01 17:51:59.996
3: 2013-01-01 17:52:00.059
4: 2013-01-01 17:54:23.901
5: 2013-01-01 17:54:23.914

duplicated(DT)
## [1] FALSE FALSE FALSE FALSE TRUE

Update from v1.9.3 from Matt

There was a change to rounding in v1.9.2 which affected milliseconds of POSIXct. More info here :

Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2

Large integers in data.table. Grouping results different in 1.9.2 compared to 1.8.10

So, the workaround now available in v1.9.3 is :

> setNumericRounding(1)   # default is 2
> duplicated(DT)
[1] FALSE FALSE FALSE FALSE FALSE

Hope you understand why the change was made and agree that we're going in the right direction.

Of course, you shouldn't have to call setNumericRounding(), that's just a workaround.

I've filed a new item on the tracker :

#5445 numeric rounding should be 0 or 1 automatically for POSIXct

Dealing with large integers in R

You are passing a floating point number to as.integer64. The loss of precision is already in your input to as.integer64:

is.double(18495608239531729)
#[1] TRUE

sprintf("%20.5f", 18495608239531729)
#[1] "18495608239531728.00000"

Pass a character string to avoid that:

library(bit64)
as.integer64("18495608239531729")
#integer64
#[1] 18495608239531729

R data.table select rows (integer comparison)

Use bit64::integer64:

require(data.table)
options(digits=15)
library(bit64)
data <- fread("A
1000200030001
1000200030002
1000200030003", colClasses = "integer64")

data[A == as.integer64("1000200030001")]
#A
#1: 1000200030001

Alternatively, deactivate auto-indexing (and lose the performance advantage from it):

options(datatable.auto.index=FALSE)
data <- data.table(A=c(1000200030001,1000200030002,1000200030003))
data[(A==1000200030001)]
# A
#1: 1000200030001


Related Topics



Leave a reply



Submit