Quickest Way to Read a Subset of Rows of a CSV

Quickest way to read a subset of rows of a CSV

I think this should work pretty quickly, but let me know since I have not tried with big data yet.

write.csv(iris,"iris.csv")

fread("shuf -n 5 iris.csv")

    V1  V2  V3  V4  V5         V6
1:  37 5.5 3.5 1.3 0.2     setosa
2:  88 6.3 2.3 4.4 1.3 versicolor
3:  84 6.0 2.7 5.1 1.6 versicolor
4: 125 6.7 3.3 5.7 2.1  virginica
5: 114 5.7 2.5 5.0 2.0  virginica

This takes a random sample of N=5 for the iris dataset.

To avoid the chance of using the header row again, this might be a useful modification:

fread("tail -n+2 iris.csv | shuf -n 5", header=FALSE)

Is it possible to efficiently get a subset of rows from a large fixed-width CSV file?

As I proposed in the comment, you can compress your data field to two bits:

-- 00
AA 01
AB 10
BB 11

That cuts your file size 12 times, so it'll be ~20GB. Considering that your processing is likely IO-bound, you may speed up processing by the same 12 times.

The resulting file will have a record length of 20,000 bytes, so it will be easy to calculate an offset to any given record. No new line symbols to consider :)

Here is how I build that binary file:

#include <fstream>
#include <iostream>
#include <string>
#include <chrono>

int main()
{
    auto t1 = std::chrono::high_resolution_clock::now();
    std::ifstream src("data.txt", std::ios::binary);
    std::ofstream bin("data.bin", std::ios::binary);
    size_t length = 80'000 * 3 + 9 + 2; // the `2` is a length of CR/LF on my Windows; use `1` for other systems
    std::string str(length, '\0');
    while (src.read(&str[0], length))
    {
        size_t pos = str.find(',') + 1;
        for (int group = 0; group < 2500; ++group) {
            uint64_t compressed(0), field(0);
            for (int i = 0; i < 32; ++i, pos += 3) {
                if (str[pos] == '-')
                    field = 0;
                else if (str[pos] == 'B')
                    field = 3;
                else if (str[pos + 1] == 'B')
                    field = 2;
                else
                    field = 1;

                compressed <<= 2;
                compressed |= field;
            }
            bin.write(reinterpret_cast<char*>(&compressed), sizeof compressed);
        }
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << std::endl;

    // clear `bad` bit set by trying to read past EOF
    src.clear();
    // rewind to the first record
    src.seekg(0);
    src.read(&str[0], length);
    // read next (second) record
    src.read(&str[0], length);
    // read forty second record from start (skip 41)
    src.seekg(41 * length, std::ios_base::beg);
    src.read(&str[0], length);
    // read next (forty third) record
    src.read(&str[0], length);
    // read fifties record (skip 6 from current position)
    src.seekg(6 * length, std::ios_base::cur);
    src.read(&str[0], length);

    return 0;
}

This can encode about 1,600 records in a second, so the whole file will take ~15 minutes. How long does it take you now to process it?

UPDATE:

Added example of how to read individual records from src.

I only managed to make seekg() work in binary mode.

How to get some subset of data from a csv file for big-data(comparing csv's)?

Read in your data with an appropriate import function like read.csv and then join it.

library(dplyr)

## read your files (possibly you need to adjust some arguments in read.csv)
file1 <- read.csv("path/to/file1.csv", header = TRUE)
file2 <- read.csv("path/to/file2.csv", header = TRUE)

file2 %>%
  left_join(file1, by = "SYMBOL)

Strategies for reading in CSV files in pieces?

You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.

If you need only a single portion then read.csv.sql in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.

Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.

DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)

See ?read.csv.sql and ?sqldf and also the sqldf home page.

How can I partially read a huge CSV file?

Use chunksize:

for df in pd.read_csv('matrix.txt',sep=',', header = None, chunksize=1):
    #do something

To answer your second part do this:

df = pd.read_csv('matrix.txt',sep=',', header = None, skiprows=1000, chunksize=1000)

This will skip the first 1000 rows and then only read the next 1000 rows giving you rows 1000-2000, unclear if you require the end points to be included or not but you can fiddle the numbers to get what you want.

Quickest Way to Read a Subset of Rows of a CSV