Data.Table Is Not Handling Integer64 in by Statement

data.table is not handling integer64 in by statement

Update: This is now implemented in v1.9.3 (available from R-Forge), see NEWS :

o bit64::integer64 now works in grouping and joins, #5369. Thanks to James Sams for highlighting UPCs and Clayton Stanley.

Reminder: fread() has been able to detect and read integer64 for a while.

On OP's example above:

test[, .N, by=ID]
#                    ID N
# 1: 432706205348805058 2
# 2: 432706205348805059 1

integer64 isn't yet implemented for data.table operations such as setkey or by. It was just implemented in fread only (first released to CRAN on 6 March 2013) as a first step. It could be useful as a value column for example.

I may have confused matters by filing a bug report relating to this (the one @Arun linked to). Strictly speaking, it isn't a bug but a feature request. I think of the bug list more like 'important things to resolve before the next release'.

Contributions are very welcome.

datatable.integer64 argument is not working for me should it?

This is implemented in v1.8.11, on R-Forge but not yet on CRAN. From NEWS :

o fread's integer64 argument implemented. Allows reading of integer64 data as 'double' or 'character'
instead of bit64::integer64 (which remains the default as before). Thanks to Chris Neff for the
suggestion. The default can be changed globally; e.g, options(datatable.integer64="character")

Regarding :

If colClasses is the answer, I think it does not allow to specify a single column name or index and the table I load has tens of columns so unpracticable...

colClasses in fread does let you override type for one or a few columns (by name or by number), and the rest will be automatically detected. For exactly the reason you state. If it doesn't, please report as a bug. An alternative to colClasses is the datatable.integer64 global option which lets you tell fread that whenever it detects integer64 it should load it as character or double instead (in v1.8.11 as well).

Dealing with large integers in R

You are passing a floating point number to as.integer64. The loss of precision is already in your input to as.integer64:

is.double(18495608239531729)
#[1] TRUE

sprintf("%20.5f", 18495608239531729)
#[1] "18495608239531728.00000"

Pass a character string to avoid that:

library(bit64)
as.integer64("18495608239531729")
#integer64
#[1] 18495608239531729

fread() fails with missing values in integer64 columns

This apparently is an issue with the bit64 package, not fread() or data.table. From the bit64 documentation http://cran.r-project.org/web/packages/bit64/bit64.pdf

"Subscripting non-existing elements and subscripting with NAs is currently not supported. Such subscripting currently returns 9218868437227407266 instead of NA (the NA value of the un-derlying double code). Following the full R behaviour here would either destroy performance or require extensive C-coding."

I tried reassigning the 9218868437227407266 value to NA thinking it would work

Ex.

DT[V8==9218868437227407266, ]
#actually returns nothing, but
DT[V8==max(V8), ]
#returns the rows with 9218868437227407266 in V8
#but this does not reassign the value 
DT[V8==max(V8), V8:=NA]
#not that this makes sense, but I tried just in case...
DT[V8==max(V8), V8:=NA_character_]

So as the documentation pretty clearly states, if a vector is class integer64 it won't recognize NA or missing values. I've going to avoid bit64 just to not have to deal with this...

Data.Table Is Not Handling Integer64 in by Statement