Big Data Read Subsamples R

Big data read subsamples R

I think it's an exceedingly terrible idea to use CSV as your data format for such file sizes - why not transform it into a SQLite (or "actual" database) and extract your subsets with SQL queries (using DBI/RSQLite2)?

You need to import only once, and there is no need to load the entire thing into memory because you can directly import CSV files into sqlite.

If in general you want to work with datasets that are larger than your memory, you might also want to have a look at bigmemory.

Is there a faster way than fread() to read big data?

You can use select = columns to only load the relevant columns without saturating your memory. For example:

dt <- fread("./file.csv", select = c("column1", "column2", "column3"))

I used read.delim() to read a file that fread() could not load completely. So you could convert your data into .txt and use read.delim().

However, why don't you open a connection to the SQL server you're pulling your data from. You can open connections to SQL servers with library(odbc) and write your query like you normally would. You can optimize your memory usage that way.

Check out this short introduction to odbc.

Importing and extracting a random sample from a large .CSV in R

I think that there is not a good R tool to read a file in a random way (maybe it can be an extension read.table or fread(data.table package)) .

Using perl you can easily do this task. For example , to read 1% of your file in a random way, you can do this :

xx= system(paste("perl -ne 'print if (rand() < .01)'",big_file),intern=TRUE)

Here I am calling it from R using system. xx contain now only 1% of your file.

You can wrap all this in a function:

read_partial_rand <- 
  function(big_file,percent){
    cmd <- paste0("perl -ne 'print if (rand() < ",percent,")'")
    cmd <- paste(cmd,big_file)
    system(cmd,intern=TRUE)
  }

Select a subsample of columns when reading a file in the tidyverse

You can use the col_types() argument of read_csv(), passing cols_only() with the columns you need and their type (also guessable):

read_csv('loadTest.csv', 
         col_types = cols_only('col1' = col_integer(), #col1 is integer
                               'col2' = 'c',           #col2 is character
                               'col8' = col_guess()    #guess type
                               'col10' = '?'           #guess type
                               )
)

Sample random rows in dataframe

First make some data:

> df = data.frame(matrix(rnorm(20), nrow=10))
> df
           X1         X2
1   0.7091409 -1.4061361
2  -1.1334614 -0.1973846
3   2.3343391 -0.4385071
4  -0.9040278 -0.6593677
5   0.4180331 -1.2592415
6   0.7572246 -0.5463655
7  -0.8996483  0.4231117
8  -1.0356774 -0.1640883
9  -0.3983045  0.7157506
10 -0.9060305  2.3234110

Then select some rows at random:

> df[sample(nrow(df), 3), ]
           X1         X2
9  -0.3983045  0.7157506
2  -1.1334614 -0.1973846
10 -0.9060305  2.3234110

Big Data Read Subsamples R