Big data read subsamples R
I think it's an exceedingly terrible idea to use CSV as your data format for such file sizes - why not transform it into a SQLite (or "actual" database) and extract your subsets with SQL queries (using DBI/RSQLite2)?
You need to import only once, and there is no need to load the entire thing into memory because you can directly import CSV files into sqlite.
If in general you want to work with datasets that are larger than your memory, you might also want to have a look at bigmemory.
Is there a faster way than fread() to read big data?
You can use select = columns
to only load the relevant columns without saturating your memory. For example:
dt <- fread("./file.csv", select = c("column1", "column2", "column3"))
I used read.delim()
to read a file that fread()
could not load completely. So you could convert your data into .txt and use read.delim()
.
However, why don't you open a connection to the SQL server you're pulling your data from. You can open connections to SQL servers with library(odbc)
and write your query like you normally would. You can optimize your memory usage that way.
Check out this short introduction to odbc
.
Importing and extracting a random sample from a large .CSV in R
I think that there is not a good R tool to read a file in a random way (maybe it can be an extension read.table
or fread
(data.table package)) .
Using perl
you can easily do this task. For example , to read 1% of your file in a random way, you can do this :
xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)
Here I am calling it from R using system
. xx contain now only 1% of your file.
You can wrap all this in a function:
read_partial_rand <-
function(big_file,percent){
cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
cmd <- paste(cmd,big_file)
system(cmd,intern=TRUE)
}
Select a subsample of columns when reading a file in the tidyverse
You can use the col_types()
argument of read_csv()
, passing cols_only()
with the columns you need and their type (also guessable):
read_csv('loadTest.csv',
col_types = cols_only('col1' = col_integer(), #col1 is integer
'col2' = 'c', #col2 is character
'col8' = col_guess() #guess type
'col10' = '?' #guess type
)
)
Sample random rows in dataframe
First make some data:
> df = data.frame(matrix(rnorm(20), nrow=10))
> df
X1 X2
1 0.7091409 -1.4061361
2 -1.1334614 -0.1973846
3 2.3343391 -0.4385071
4 -0.9040278 -0.6593677
5 0.4180331 -1.2592415
6 0.7572246 -0.5463655
7 -0.8996483 0.4231117
8 -1.0356774 -0.1640883
9 -0.3983045 0.7157506
10 -0.9060305 2.3234110
Then select some rows at random:
> df[sample(nrow(df), 3), ]
X1 X2
9 -0.3983045 0.7157506
2 -1.1334614 -0.1973846
10 -0.9060305 2.3234110
Related Topics
How Are Percpu Pointers Implemented in the Linux Kernel
Go Http Server Testing Ab VS Wrk So Much Difference in Result
Is It Safe to Issue Blocking Write() Calls on the Same Tcp Socket from Multiple Threads
Bluetooth Low Energy in C - Using Bluez to Create a Gatt Server
File Size in Human Readable Format
How to Run a Command as a Specific User in an Init Script
How to Gzip Standard in to a File and Also Print Standard in to Standard Out
Add Column to End of CSV File Using 'Awk' in Bash Script
Is Clock_Gettime() Adequate for Submicrosecond Timing
Avrdude: Ser_Open(): Can't Open Device "/Dev/Ttyacm0": Device or Resource Busy
Critical Timing in an Arm Linux Kernel Driver
Accessing Linux /Dev/Usb as Standard Files to Communicate with Usb Device
Extract Date from a File Name in Unix Using Shell Scripting
Redirect Process Stdin and Stdout to Netcat