Large integers in data.table. Grouping results different in 1.9.2 compared to 1.8.10
Yes the result in v1.8.10 was the correct behaviour. We improved the method of rounding in v1.9.2. That's best explained here :
Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2
That meant we went backwards on supporting integers > 2^31 stored in type numeric
. That's now addressed in v1.9.3 (available from R-Forge), see NEWS :
o
bit64::integer64
now works in grouping and joins, #5369. Thanks to James Sams for highlighting UPCs and Clayton Stanley.
Reminder:fread()
has been able to detect and readinteger64
for a while.o New function
setNumericRounding()
may be used to reduce to 1 byte
or 0 byte rounding when joining
to or grouping columns of typenumeric
, #5369. See example in?setNumericRounding
and NEWS
item from v1.9.2.getNumericRounding()
returns the current setting.
So you can either call setNumericRounding(0)
to switch off rounding globally for all numeric
columns, or better, use the more appropriate type for the column: bit64::integer64
now that it's supported.
The change in v1.9.2 was :
o Numeric data is still joined and grouped within tolerance as before
but instead of tolerance being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default) the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle. A few functions provided a 'tolerance' argument but this wasn't being passed through so has been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release [DONE].
The example in ?setNumericRounding
is :
> DT = data.table(a=seq(0,1,by=0.2),b=1:2, key="a")
> DT
a b
1: 0.0 1
2: 0.2 2
3: 0.4 1
4: 0.6 2
5: 0.8 1
6: 1.0 2
> setNumericRounding(0) # turn off rounding; i.e. if we didn't round
> DT[.(0.4)] # works
a b
1: 0.4 1
> DT[.(0.6)] # no match!, confusing to users
a b # 0.6 is clearing there in DT, and 0.4 worked ok!
1: 0.6 NA
>
> setNumericRounding(2) # restore default
> DT[.(0.6)] # now works as user expects
a b
1: 0.6 2
>
> # using type 'numeric' for integers > 2^31 (typically ids)
> DT = data.table(id = c(1234567890123, 1234567890124, 1234567890125), val=1:3)
> DT[,.N,by=id] # 1 row (the last digit has been rounded)
id N
1: 1.234568e+12 3
> setNumericRounding(0) # turn off rounding
> DT[,.N,by=id] # 3 rows (the last digit wasn't rounded)
id N
1: 1.234568e+12 1
2: 1.234568e+12 1
3: 1.234568e+12 1
> # but, better to use bit64::integer64 for such ids instead of numeric
> setNumericRounding(2) # restore default, preferred
Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2
It is worth reading R FAQ 7.31 and thinking about the accuracy of floating point represenations.
I can't reproduce this in the current cran version (1.9.2). using
R version 3.0.3 (2014-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
My guess that the change in behaivour will be related to this news item.
o Numeric data is still joined and grouped within tolerance as before but instead of tolerance
being sqrt(.Machine$double.eps) == 1.490116e-08 (the same as base::all.equal's default)
the significand is now rounded to the last 2 bytes, apx 11 s.f. This is more appropriate
for large (1.23e20) and small (1.23e-20) numerics and is faster via a simple bit twiddle.
A few functions provided a 'tolerance' argument but this wasn't being passed through so has
been removed. We aim to add a global option (e.g. 2, 1 or 0 byte rounding) in a future release.
Update from Matt
Yes this was a deliberate change in v1.9.2 and data.table
now distinguishes 0.0000000000000000000000000001
from 0
(as user3340145 rightly thought it should) due to the improved rounding method highlighted above from NEWS.
I've also added the for
loop test from Rick's answer to the test suite.
Btw, #5369 is now implemented in v1.9.3 (although neither of these are needed for this question) :
o bit64::integer64 now works in grouping and joins, #5369. Thanks to
James Sams for highlighting UPCs.o New function setNumericRounding() may be used to reduce to 1 byte
or 0 byte rounding when joining to or grouping columns of type 'numeric', #5369.
See example in ?setNumericRounding and NEWS item from v1.9.2.
getNumericRounding() returns the current setting.
Notice that rounding is now (as from v1.9.2) about the accuracy of the significand; i.e. the number of significant figures. 0.0000000000000000000000000001 == 1.0e-28
is accurate to just 1 s.f., so the new rounding method doesn't group this together with 0.0
.
In short, the answer to the question is : upgrade from v1.8.10 to v1.9.2 or greater.
Rounding milliseconds of POSIXct in data.table v1.9.2 (ok in 1.8.10)
Yes I reproduced your result with v1.9.2.
library(data.table)
DT <- data.table(timestamp=c(as.POSIXct("2013-01-01 17:51:00.707"),
as.POSIXct("2013-01-01 17:51:59.996"),
as.POSIXct("2013-01-01 17:52:00.059"),
as.POSIXct("2013-01-01 17:54:23.901"),
as.POSIXct("2013-01-01 17:54:23.914")))
options(digits.secs=3) # usually placed in .Rprofile
DT
timestamp
1: 2013-01-01 17:51:00.707
2: 2013-01-01 17:51:59.996
3: 2013-01-01 17:52:00.059
4: 2013-01-01 17:54:23.901
5: 2013-01-01 17:54:23.914
duplicated(DT)
## [1] FALSE FALSE FALSE FALSE TRUE
Update from v1.9.3 from Matt
There was a change to rounding in v1.9.2 which affected milliseconds of POSIXct. More info here :
Grouping very small numbers (e.g. 1e-28) and 0.0 in data.table v1.8.10 vs v1.9.2
Large integers in data.table. Grouping results different in 1.9.2 compared to 1.8.10
So, the workaround now available in v1.9.3 is :
> setNumericRounding(1) # default is 2
> duplicated(DT)
[1] FALSE FALSE FALSE FALSE FALSE
Hope you understand why the change was made and agree that we're going in the right direction.
Of course, you shouldn't have to call setNumericRounding()
, that's just a workaround.
I've filed a new item on the tracker :
#5445 numeric rounding should be 0 or 1 automatically for POSIXct
Dealing with large integers in R
You are passing a floating point number to as.integer64
. The loss of precision is already in your input to as.integer64
:
is.double(18495608239531729)
#[1] TRUE
sprintf("%20.5f", 18495608239531729)
#[1] "18495608239531728.00000"
Pass a character string to avoid that:
library(bit64)
as.integer64("18495608239531729")
#integer64
#[1] 18495608239531729
R data.table select rows (integer comparison)
Use bit64::integer64
:
require(data.table)
options(digits=15)
library(bit64)
data <- fread("A
1000200030001
1000200030002
1000200030003", colClasses = "integer64")
data[A == as.integer64("1000200030001")]
#A
#1: 1000200030001
Alternatively, deactivate auto-indexing (and lose the performance advantage from it):
options(datatable.auto.index=FALSE)
data <- data.table(A=c(1000200030001,1000200030002,1000200030003))
data[(A==1000200030001)]
# A
#1: 1000200030001
Related Topics
How to Compare Two Factors with Different Levels
How to Replace Lower/Upper Triangular Elements of a Matrix
Outputting Difftime as Hh:Mm:Ss:Mm in R
How to Load Dependencies in an R Package
Shift a Column of Lists in Data.Table by Group
Calculate Percentages of a Binary Variable by Another Variable in R
Create Several Dummy Variables from One String Variable
Count Total Missing Values by Group
Constroptim in R - Init Val Is Not in the Interior of the Feasible Region Error
Stacking Multiple Columns Using Pivot Longer in R
Indexing Integer Vector with Na
"Object Not Found" Error Within a User Defined Function, Eval() Function
R - Unable to Install R Packages - Cannot Open the Connection
R: Formatting Plotly Hover Text
Logical Comparison of Two Vectors with Binary (0/1) Result
Aggregate by Multiple Columns and Reshape from Long to Wide