Fastest Way to Read in 100,000 .Dat.Gz Files

Fastest way to read in 100,000 .dat.gz files

I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.

tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl

Subsetting many .dat.gz files using fread and awk

This question received an answer in the comments.

ok..it seems to work with the following:

command <- "cat ./practice/*dat.gz | gunzip | awk -F, '!/^Day/' | awk '$14 != 0 || $15 != 0'"
Is this taking 2 passes at the data? It seems like it might slow things down over many many files, but it does seem to work.

No this isn't 2 passes on data. Its' pretty efficient. But missed one other minor optimization before: you can further simplify to gunzip -c ./path/to/files*.dat.gz | awk ...

Decompress gz file using R

If you really want to uncompress the file, just use the untar function which does support gzip.
E.g.:

untar('chadwick-0.5.3.tar.gz')

How to get few lines from a .gz compressed file without uncompressing

zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.

Switch to gzip -cd in place of zcat and your command should work fine:

 gzip -cd CONN.20111109.0057.gz | head

Explanation

   -c --stdout --to-stdout
          Write output on standard output; keep original files unchanged.  If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
          them.

   -d --decompress --uncompress
          Decompress.

Split equivalent of gzip files in python

I don't believe that split works the way you think it does. It doesn't split the gzip file into smaller gzip files. I.e. you can't call gunzip on the individual files it creates. It literally breaks up the data into smaller chunks and if you want to gunzip it, you have to concatenate all the chunks back together first. So, to emulate the actual behavior with Python, we'd do something like:

infile_name = "file.dat.gz"

chunk = 50*1024*1024 # 50MB

with open(infile_name, 'rb') as infile:
    for n, raw_bytes in enumerate(iter(lambda: infile.read(chunk), b'')):
        print(n, chunk)
        with open('{}.part-{}'.format(infile_name[:-3], n), 'wb') as outfile:
            outfile.write(raw_bytes)

In reality we'd read multiple smaller input chunks to make one output chunk to use less memory.

We might be able to break the file into smaller files that we can individually gunzip, and still make our target size. Using something like a bytesIO stream, we could gunzip the file and gzip it into that memory stream until it was the target size then write it out and start a new bytesIO stream.

With compressed data, you have to measure the size of the output, not the size of the input as we can't predict how well the data will compress.

fread together with grepl

Firstly, I discovered that grepl function does not work properly since
fread makes the data as one column indicated also in this question.

But that question's accepted answer says that problem was fixed in v1.9.6. Which version are you using? That's why we ask you to please state the version number up front, to save time answering.

It is a great example file and the question is great.

I would not try to reinvent the wheel as operations like these have long been implemented as command line tools, which you can use together with fread directly. The advantage is that you won't churn through R memory, you can leave the filtering to the command tool and that can be much more efficient. For example, if you load all the lines as lines into R, those strings will be cached in R's global string cache (at least temporarily). Doing that filter outside R first will save that cost.

I downloaded your great file and tested the following which works.

> fread("grep -v TRIAL sum_data.txt")
         V1   V2 V3      V4      V5      V6      V7      V8      V9     V10     V11     V12    V13     V14    V15          V16       V17
     1:   2  0.1  0 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709  0.0035  0.0079 -0.9754 0.0081  0.0023 0.9997 -1.35324e-10 0.0278754
     2:   2  0.2  0 -0.0121  0.0002 -0.9898 -0.0364 -0.0027 -0.9925 -0.0242 -0.0050 -0.9929 0.0029 -0.0023 0.9998 -1.33521e-10 0.0425567
     3:   2  0.3  0  0.0193 -0.0068 -0.9884  0.0040  0.0139 -0.9782 -0.0158  0.0150 -0.9814 0.0054 -0.0008 0.9997 -1.34103e-10 0.0255356
     4:   2  0.4  0 -0.0157  0.0183 -0.9879 -0.0315 -0.0311 -0.9908 -0.0314 -0.0160 -0.9929 0.0040  0.0010 0.9998 -1.34819e-10 0.0257300
     5:   2  0.5  0 -0.0402  0.0300 -0.9832 -0.0093  0.0269 -0.9781 -0.0326  0.0247 -0.9802 0.0044 -0.0010 0.9997 -1.31515e-10 0.0440350
    ---                                                      

124247: 250 49.5  0 -0.0040  0.0141  0.9802 -0.0152  0.0203 -0.9877 -0.0015  0.0123 -0.9901 0.0069  0.0003 0.9997 -1.30220e-10 0.0213215
124248: 250 49.6  0 -0.0006  0.0284  0.9819  0.0021  0.0248 -0.9920  0.0264  0.0408 -0.9919 0.0028 -0.0028 0.9997 -1.30295e-10 0.0284142
124249: 250 49.7  0  0.0378  0.0305  0.9779 -0.0261  0.0232 -0.9897 -0.0236  0.0137 -0.9928 0.0102 -0.0023 0.9997 -1.29890e-10 0.0410760
124250: 250 49.8  0  0.0569 -0.0203  0.9800 -0.0028 -0.0009 -0.9906 -0.0139 -0.0169 -0.9918 0.0039 -0.0017 0.9997 -1.31555e-10 0.0513482
124251: 250 49.9  0  0.0234 -0.0358  0.9840 -0.0340  0.0114 -0.9873 -0.0255  0.0134 -0.9888 0.0006  0.0009 0.9997 -1.30862e-10 0.0334976
>

The -v makes grep return all lines except lines containing the string TRIAL. Given the number of high quality engineers that have looked at the command tool grep over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to do more complicated string filters (e.g. strings at the beginning or the end of the lines, etc) then grep syntax is very powerful. Learning its syntax is a transferable skill to other languages and environments.

For further examples on the use of command line tools in fread, you may check the article Convenience features of fread. Please note that "On Windows we recommend Cygwin (run one .exe to install) which includes the command line tools such as grep".

Fastest Way to Read in 100,000 .Dat.Gz Files