R Reading a Huge CSV

How do I import a large (6 Gb) .csv file into R efficiently and quickly, without the R REPL crashing?

R will crash if you try to load a file that is larger than your available memory, so you should see that you have at least 6gb ram free (a 6gb .csv is roughly 6gb in memory also). Python will have the same problem
(apparently someone asked the exact same question for python a few years ago)

For reading large csv files, you should either use readr::read_csv() or data.table::fread(), as both are much faster than base::read.table().

readr::read_csv_chunked supports reading csv files in chunks, so if you don't need your whole data at once, that might help. You could also try just reading the columns of interest, to keep the memory size smaller.

R - How can I import a huge .csv with chunks?

You can read a csv file in chunks with readr::read_csv using the skip and n_max arguments: skip is the number of lines to skip at the start, n_max is the number of lines to read afterwards.

library("readr")

# Example uses `#` as the separator
file <- "
lineno#X#Y#Z
1#a#b#c
2#d#e#f
3#g#h#i
4#j#k#l
5#m#n#o
6#p#q#r
7#s#t#u
8#v#w#
9#x#y#z
"

# Increase the chunk size appropriately
chunk_size <- 3

# Assumption: There is a header on the first line
# but we don't know what it is.
col_names <- TRUE
line_num <- 1

while (TRUE) {
chunk <- read_delim(
file, "#",
skip = line_num,
n_max = chunk_size,
# On the first iteration, col_names is TRUE
# so the first line "X,Y,Z" is assumed to be the header
# On any subsequent iteration, col_names is a character vector
# of the actual column names
col_names = col_names
)

# If the chunk has now rows, then reached end of file
if (!nrow(chunk)) {
break
}

# Do something with the chunk of data
print(chunk)

# Update `col_names` so that it is equal the actual column names
col_names <- colnames(chunk)

# Move to the next chunk. Add 1 for the header.
line_num <- line_num + chunk_size + (line_num == 1)
}
#> # A tibble: 3 x 4
#> lineno X Y Z
#> <dbl> <chr> <chr> <chr>
#> 1 1 a b c
#> 2 2 d e f
#> 3 3 g h i
#> # A tibble: 3 x 4
#> lineno X Y Z
#> <dbl> <chr> <chr> <chr>
#> 1 4 j k l
#> 2 5 m n o
#> 3 6 p q r
#> # A tibble: 3 x 4
#> lineno X Y Z
#> <dbl> <chr> <chr> <chr>
#> 1 7 s t u
#> 2 8 v w <NA>
#> 3 9 x y z

Created on 2019-10-31 by the reprex package (v0.3.0)

How to deal with a 50GB large csv file in r language?

You can use R with SQLite behind the curtains with the sqldf package. You'd use the read.csv.sql function in the sqldf package and then you can query the data however you want to obtain the smaller data frame.

The example from the docs:

library(sqldf)

iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where Species = 'setosa' ")

I've used this library on VERY large CSV files with good results.

Loosing column of data when importing large csv file with read_csv

I was bitten by a similar problem recently, so this is a guess based on that experience.

By default, if the 1000 first entries of a column are NA, readr::read_csv will automatically set all values of this column to NA. You can control this by setting the guess_max argument. Here is the documentation:

guess_max: Maximum number of records to use for guessing column types.

For example,

library(readr)
dat <- read_csv("file.csv", guess_max=100000)


Related Topics



Leave a reply



Submit