Reading 40 Gb CSV File into R Using Bigmemory

Reading 40 GB csv file into R using bigmemory

I don't know about bigmemory, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL lines and randomly select N lines, and then read that in.

Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines).

read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
                       !/NULL/{if (rand() < m/(length - NR + 1)) {
                                 print; m--;
                                 if (m == 0) exit;
                              }}\' filename'
        )) -> df

It wasn't obvious to me what you meant by NULL, so I used literal understanding of it, but it should be easy to modify it to fit your needs.

Trimming a huge (3.5 GB) csv file to read into R

My try with readLines. This piece of a code creates csv with selected years.

file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers

B <- 300000 # depends how large is one pack
while(length(x)) {
    ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
    if (length(ind)) writeLines(x[ind], file_out)
    x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)

how to read huge csv file into R by row condition?

You can use the RSQLite package:

library(RSQLite)
# Create/Connect to a database
con <- dbConnect("SQLite", dbname = "sample_db.sqlite")

# read csv file into sql database
# Warning: this is going to take some time and disk space, 
#   as your complete CSV file is transferred into an SQLite database.
dbWriteTable(con, name="sample_table", value="Your_Big_CSV_File.csv", 
    row.names=FALSE, header=TRUE, sep = ",")

# Query your data as you like
yourData <- dbGetQuery(con, "SELECT * FROM sample_table LIMIT 10")

dbDisconnect(con)

Next time you want to access your data you can leave out the dbWriteTable, as the SQLite table is stored on disk.

Note: the writing of the CSV data to the SQLite file does not load all data in memory first. So the memory you will use in the end will be limited to the amount of data that your query returns.

Strategies for reading in CSV files in pieces?

You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.

If you need only a single portion then read.csv.sql in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.

Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.

DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)

See ?read.csv.sql and ?sqldf and also the sqldf home page.

Reading 40 Gb CSV File into R Using Bigmemory