Reading in Chunks at a Time Using Fread in Package Data.Table

A bug with using fread (data.table in R) when reading large numbers?

fread automatically assumed the first column's class to be integer64. From its help file:

integer64 = "integer64" (default) reads columns detected as containing integers
larger than 2^31 as type bit64::integer64. Alternatively,
"double"|"numeric" reads as base::read.csv does; i.e., possibly with
loss of precision and if so silently. Or, "character".

The values in the first column are: 201500000001, 201500000002, etc. If you treat them as numbers, they are larger than 2^31 (i.e. 2147483648). Thus fread interpreted them as integer64 values, & caused them to look really strange.

data.table will automatically load the bit64 package for you in this situation so that the numbers display properly. However, when you don't have bit64 installed, as you likely don't, it is supposed to warn you and ask you to install it. That lack of warning is bug fix 5 in the development version v1.10.5. From NEWS :

When fread() and print() see integer64 columns are present but package bit64 is not installed, the warning is now displayed as intended. Thanks to a question by Santosh on r-help and forwarded by Bill Dunlap.

So, just install.packages("bit64") and you're good. You don't need to reload the data. It just affects how those columns are printed.

Alternatively, if you add the argument integer64 = "numeric" to your fread function, the result will match what you got from read.csv. But if it's an ID column, conceptually it should be a character or factor, rather than integer. You can use the argument colClasses=c("Num_Acc"="character") for that.

reading strand (+, -) column with fread, data.table package

Thanks for reporting. Now fixed in v1.8.9 commit 849. + and - are now read as character, test added.

Btw, we are also intending to add colClasses so that you can override the column type that fread detects. The outstanding to do list relating to fread is at the top of the source file here :

https://r-forge.r-project.org/scm/viewvc.php/pkg/src/fread.c?view=markup&root=datatable

data.table fread and ISO8601

Not sure if this is by design or not, but the culprit is keepLeadingZeros=TRUE, an option I set for other reasons.

withr::with_options(
list(datatable.keepLeadingZeros=FALSE),
fread(text=c("now","2020-07-24T10:11:12.134Z"), sep=",")
)
# now
# <POSc>
# 1: 2020-07-24 10:11:12.134

withr::with_options(
list(datatable.keepLeadingZeros=TRUE),
fread(text=c("now","2020-07-24T10:11:12.134Z"), sep=",")
)
# now
# <char>
# 1: 2020-07-24T10:11:12.134Z

After-the-fact, I found the dupe-issue in https://github.com/Rdatatable/data.table/issues/4869, "keepLeadingZeros interferes with date recognition".


FYI (to others and to my future self), the way I found this is to start R --vanilla --no-init --no-save, install deta.table, and start testing:

### in "failing" environment:
opts <- options()
opts <- opts[ !sapply(opts, inherits, c("list", "function")) ]
dput(opts) # paste into the fresh R instance as opts2

### in the "fresh "environment"
# opts2 <- structure(...) # 'opts' from above
opts <- options()
opts <- opts[ !sapply(opts, inherits, c("list", "function")) ]
str(opts[ setdiff(names(opts2), names(opts)) ])

and one-by-one enabling options until auto-conversion failed.

Fast reading and combining several files using data.table (with fread)

Use rbindlist() which is designed to rbind a list of data.table's together...

mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )

And as @Roland says, do not set the key in each iteration of your function!

So in summary, this is best :

l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )

Is there a faster way than fread() to read big data?

You can use select = columns to only load the relevant columns without saturating your memory. For example:

dt <- fread("./file.csv", select = c("column1", "column2", "column3"))

I used read.delim() to read a file that fread() could not load completely. So you could convert your data into .txt and use read.delim().

However, why don't you open a connection to the SQL server you're pulling your data from. You can open connections to SQL servers with library(odbc) and write your query like you normally would. You can optimize your memory usage that way.

Check out this short introduction to odbc.



Related Topics



Leave a reply



Submit