Importing and extracting a random sample from a large .CSV in R
I think that there is not a good R tool to read a file in a random way (maybe it can be an extension read.table
or fread
(data.table package)) .
Using perl
you can easily do this task. For example , to read 1% of your file in a random way, you can do this :
xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)
Here I am calling it from R using system
. xx contain now only 1% of your file.
You can wrap all this in a function:
read_partial_rand <-
function(big_file,percent){
cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
cmd <- paste(cmd,big_file)
system(cmd,intern=TRUE)
}
Random Sampling multiple Dataframes by Rows in a folder using R
First, sample the data:
listdf <- lapply(listdf, FUN = function(i) sample_n(i, 300))
Then use an rbind
inside do.call
to bind all the data, note that this will only function as long as all the dataframes have the same number of columns, as well as the same column names:
data <- do.call("rbind", listdf)
How to take multiple Sample() vector outputs and combine them into a data frame
Assume that the data is held in files name for their day, like mydata_2020_05_17.csv
library(tidyverse)
readDay <- function(date, dir, sampleN){
path <- paste0(dir, "/", "mydata_", date, ".csv")
read_csv(path) %>%
as_tibble() %>%
# You many not need this if the records already have the date
mutate(DATE = date) %>%
sample_n(sampleN, replace = FALSE)
}
Lets start on the first Sunday of the month
answerWeek = map_df(seq.Date(from = as_date("2020-05-03"), length.out = 6, by = 1),
~ readDay(.x, "~/nefarious/data", sampleN = 20))
NOT RUN because I don't have a folder full of dated csv data.
Let us know if I've mis-interpreted what you're looking for.
Quickly reading very large tables as dataframes
An update, several years later
This answer is old, and R has moved on. Tweaking read.table
to run a bit faster has precious little benefit. Your options are:
Using
vroom
from the tidyverse packagevroom
for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.Using
fread
indata.table
for importing data from csv/tab-delimited files directly into R. See mnel's answer.Using
read_table
inreadr
(on CRAN from April 2015). This works much likefread
above. The readme in the link explains the difference between the two functions (readr
currently claims to be "1.5-2x slower" thandata.table::fread
).read.csv.raw
fromiotools
provides a third option for quickly reading CSV files.Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.)
read.csv.sql
in thesqldf
package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: theRODBC
package, and the reverse depends section of theDBI
package page.MonetDB.R
gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with itsmonetdb.read.csv
function.dplyr
allows you to work directly with data stored in several types of database.Storing data in binary formats can also be useful for improving performance. Use
saveRDS
/readRDS
(see below), theh5
orrhdf5
packages for HDF5 format, orwrite_fst
/read_fst
from thefst
package.
The original answer
There are a couple of simple things to try, whether you use read.table or scan.
Set
nrows
=the number of records in your data (nmax
inscan
).Make sure that
comment.char=""
to turn off interpretation of comments.Explicitly define the classes of each column using
colClasses
inread.table
.Setting
multi.line=FALSE
may also improve performance in scan.
If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table
based on the results.
The other alternative is filtering your data before you read it into R.
Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save
saveRDS
, then next time you can retrieve it faster with load
readRDS
.
Related Topics
Control Number of Decimal Places on Xtable Output in R
Error in Install.Packages:Cannot Remove Prior Installation of Package 'Dbi'
Load a Small Random Sample from a Large CSV File into R Data Frame
Calculating Standard Deviation of Each Row
Clustering Very Large Dataset in R
How to Create Binned Factor Variables from a Continuous Variable, with Custom Breaks
How to Extract Substring Between Patterns "_" and "." in R
Simple Method of Counting Non-Nas in Column of Data String
Extracting Value Based on Another Column
How to Build a Graph from a Data Frame Using the Igraph Package
R Remove Non-Alphanumeric Symbols from a String
Storing Specific Xml Node Values with R's Xmleventparse
Convert R List to Dataframe with Missing/Null Elements
Export Each Data Frame Within a List to CSV
Exceeding Memory Limit in R (Even with 24Gb Ram)
How to Rotate Legend Symbols in Ggplot2
Auto Complete and Selection of Multiple Values in Text Box Shiny
Create Tables with Conditional Formatting with Rmarkdown + Knitr