Trimming a huge (3.5 GB) csv file to read into R
My try with readLines
. This piece of a code creates csv
with selected years.
file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers
B <- 300000 # depends how large is one pack
while(length(x)) {
ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
if (length(ind)) writeLines(x[ind], file_out)
x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)
how to read huge csv file into R by row condition?
You can use the RSQLite
package:
library(RSQLite)
# Create/Connect to a database
con <- dbConnect("SQLite", dbname = "sample_db.sqlite")
# read csv file into sql database
# Warning: this is going to take some time and disk space,
# as your complete CSV file is transferred into an SQLite database.
dbWriteTable(con, name="sample_table", value="Your_Big_CSV_File.csv",
row.names=FALSE, header=TRUE, sep = ",")
# Query your data as you like
yourData <- dbGetQuery(con, "SELECT * FROM sample_table LIMIT 10")
dbDisconnect(con)
Next time you want to access your data you can leave out the dbWriteTable
, as the SQLite table is stored on disk.
Note: the writing of the CSV data to the SQLite file does not load all data in memory first. So the memory you will use in the end will be limited to the amount of data that your query returns.
Quickly reading very large tables as dataframes
An update, several years later
This answer is old, and R has moved on. Tweaking read.table
to run a bit faster has precious little benefit. Your options are:
Using
vroom
from the tidyverse packagevroom
for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.Using
fread
indata.table
for importing data from csv/tab-delimited files directly into R. See mnel's answer.Using
read_table
inreadr
(on CRAN from April 2015). This works much likefread
above. The readme in the link explains the difference between the two functions (readr
currently claims to be "1.5-2x slower" thandata.table::fread
).read.csv.raw
fromiotools
provides a third option for quickly reading CSV files.Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
read.csv.sql
in thesqldf
package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: theRODBC
package, and the reverse depends section of theDBI
package page.MonetDB.R
gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with itsmonetdb.read.csv
function.dplyr
allows you to work directly with data stored in several types of database.Storing data in binary formats can also be useful for improving performance. Use
saveRDS
/readRDS
(see below), theh5
orrhdf5
packages for HDF5 format, orwrite_fst
/read_fst
from thefst
package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set
nrows
=the number of records in your data (nmax
inscan
).Make sure that
comment.char=""
to turn off interpretation of comments.Explicitly define the classes of each column using
colClasses
inread.table
.Setting
multi.line=FALSE
may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table
based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save
saveRDS
, then next time you can retrieve it faster with load
readRDS
.
Row-wise Manipulation of Large Files
I can't say I've done this myself before, but I think this should work.
library( data.table )
# set the input and output files
input.file <- "foo.csv"
output.file <- sub( "\\.csv$", "_output\\.csv", input.file )
# get column names by importing the first few lines
column.names <- names( fread( input.file, header = TRUE, nrows = 3L ) )
# write those column names as a line of text (header)
cat( paste( c( column.names, "MM" ), collapse = "," ),
file = output.file, append = FALSE )
cat( "\n", file = output.file, append = TRUE )
# decide how many rows to read at a time
rows.at.a.time <- 1E4L
# begin looping
start.row <- 1L
while( TRUE ) {
# read in only the specified lines
input <- fread( input.file,
header = FALSE,
skip = start.row,
nrows = rows.at.a.time
)
# stop looping if no data was read
if( nrow( input ) == 0L ) break
# create the "MM" column
input[ , MM := rowSums( .SD[ , 5:7 ] ) ]
# append the data to the output file
fwrite( input,
file = output.file,
append = TRUE, col.names = FALSE )
# bump the `start.row` parameter
start.row <- start.row + rows.at.a.time
# stop reading if the end of the file was reached
if( nrow( input ) < rows.at.a.time ) break
}
UPDATE: to preserve character strings, you can import all columns as character by specifying in the fread
call within the loop:
colClasses = rep( "character", 280 )
Then, to perform the row sums (since you now have all character columns), you need to include a conversion there. The following would replace the single line (the one with this same comment above it) in the code:
# create the "MM" column
input[ , MM := .SD[ , 5:7 ] %>%
lapply( as.numeric ) %>%
do.call( what = cbind ) %>%
rowSums()
]
Where 5:7
is specified here, you could replace with any vector of column references to be passed to rowSums()
Note if using the above with %>%
pipes, you'll need library(magrittr)
at the top of your code to load the function.
Reading 40 GB csv file into R using bigmemory
I don't know about bigmemory
, but to satisfy your challenges you don't need to read the file in. Simply pipe some bash/awk/sed/python/whatever processing to do the steps you want, i.e. throw out NULL
lines and randomly select N
lines, and then read that in.
Here's an example using awk (assuming you want 100 random lines from a file that has 1M lines).
read.csv(pipe('awk -F, \'BEGIN{srand(); m = 100; length = 1000000;}
!/NULL/{if (rand() < m/(length - NR + 1)) {
print; m--;
if (m == 0) exit;
}}\' filename'
)) -> df
It wasn't obvious to me what you meant by NULL
, so I used literal understanding of it, but it should be easy to modify it to fit your needs.
Related Topics
Multiple Use of the Positional '$' Operator to Update Nested Arrays
Order of Operator Precedence When Using ":" (The Colon)
Efficiently Generate a Random Sample of Times and Dates Between Two Dates
Get "Embedded Nul(S) Found in Input" When Reading a CSV Using Read.Csv()
Efficiently Convert Backslash to Forward Slash in R
Increase Distance Between Text and Title on the Y-Axis
R Shiny - Add Tabpanel to Tabsetpanel Dynamically (With the Use of Renderui)
Posix Character Class Does Not Work in Base R Regex
Check Existence of Directory and Create If Doesn't Exist
Trimming a Huge (3.5 Gb) CSV File to Read into R
Replace/Translate Characters in a String
Remove Extra Legends in Ggplot2
Subset Data to Contain Only Columns Whose Names Match a Condition