How do I import a large (6 Gb) .csv file into R efficiently and quickly, without the R REPL crashing?
R will crash if you try to load a file that is larger than your available memory, so you should see that you have at least 6gb ram free (a 6gb .csv is roughly 6gb in memory also). Python will have the same problem
(apparently someone asked the exact same question for python a few years ago)
For reading large csv files, you should either use readr::read_csv()
or data.table::fread()
, as both are much faster than base::read.table()
.
readr::read_csv_chunked
supports reading csv files in chunks, so if you don't need your whole data at once, that might help. You could also try just reading the columns of interest, to keep the memory size smaller.
R - How can I import a huge .csv with chunks?
You can read a csv file in chunks with readr::read_csv
using the skip
and n_max
arguments: skip
is the number of lines to skip at the start, n_max
is the number of lines to read afterwards.
library("readr")
# Example uses `#` as the separator
file <- "
lineno#X#Y#Z
1#a#b#c
2#d#e#f
3#g#h#i
4#j#k#l
5#m#n#o
6#p#q#r
7#s#t#u
8#v#w#
9#x#y#z
"
# Increase the chunk size appropriately
chunk_size <- 3
# Assumption: There is a header on the first line
# but we don't know what it is.
col_names <- TRUE
line_num <- 1
while (TRUE) {
chunk <- read_delim(
file, "#",
skip = line_num,
n_max = chunk_size,
# On the first iteration, col_names is TRUE
# so the first line "X,Y,Z" is assumed to be the header
# On any subsequent iteration, col_names is a character vector
# of the actual column names
col_names = col_names
)
# If the chunk has now rows, then reached end of file
if (!nrow(chunk)) {
break
}
# Do something with the chunk of data
print(chunk)
# Update `col_names` so that it is equal the actual column names
col_names <- colnames(chunk)
# Move to the next chunk. Add 1 for the header.
line_num <- line_num + chunk_size + (line_num == 1)
}
#> # A tibble: 3 x 4
#> lineno X Y Z
#> <dbl> <chr> <chr> <chr>
#> 1 1 a b c
#> 2 2 d e f
#> 3 3 g h i
#> # A tibble: 3 x 4
#> lineno X Y Z
#> <dbl> <chr> <chr> <chr>
#> 1 4 j k l
#> 2 5 m n o
#> 3 6 p q r
#> # A tibble: 3 x 4
#> lineno X Y Z
#> <dbl> <chr> <chr> <chr>
#> 1 7 s t u
#> 2 8 v w <NA>
#> 3 9 x y z
Created on 2019-10-31 by the reprex package (v0.3.0)
How to deal with a 50GB large csv file in r language?
You can use R with SQLite behind the curtains with the sqldf package. You'd use the read.csv.sql
function in the sqldf
package and then you can query the data however you want to obtain the smaller data frame.
The example from the docs:
library(sqldf)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where Species = 'setosa' ")
I've used this library on VERY large CSV files with good results.
Loosing column of data when importing large csv file with read_csv
I was bitten by a similar problem recently, so this is a guess based on that experience.
By default, if the 1000 first entries of a column are NA
, readr::read_csv
will automatically set all values of this column to NA
. You can control this by setting the guess_max
argument. Here is the documentation:
guess_max: Maximum number of records to use for guessing column types.
For example,
library(readr)
dat <- read_csv("file.csv", guess_max=100000)
Related Topics
Displaying Data in the Chart Based on Plotly_Click in R Shiny
Bookmarking and Saving the Bookmarks in R Shiny
Increase Legend Font Size Ggplot2
How to Remove "Rows" with a Na Value
Ggally::Ggpairs Plot Without Gridlines When Plotting Correlation Coefficient
How to Create a Range of Dates in R
Ordering Permutation in Rcpp I.E. Base::Order()
Reshape Wide Format, to Multi-Column Long Format
Weird As.Posixct Behavior Depending on Daylight Savings Time
Clustering Algorithm for Obtaining Equal Sized Clusters
How to Change the Default Font Size in Ggplot2
Figures Captions and Labels in Knitr
How to Save a Data Frame as CSV to a User Selected Location Using Tcltk
How to Plot a Subset of a Data Frame in R
What's the Difference Between Hex Code (\X) and Unicode (\U) Chars