Write.CSV for Large Data.Table

write.csv for large data.table

UPDATE 2019.01.07:

fwrite has been on CRAN since 2016-11-25.

install.packages("data.table")

UPDATE 08.04.2016:

fwrite has been recently added to the data.table package's development version. It also runs in parallel (implicitly).

# Install development version of data.table
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table", type = "source")

# Load package
library(data.table)

# Load data
data(USArrests)

# Write CSV
fwrite(USArrests, "USArrests_fwrite.csv")

According to the detailed benchmark tests shown under speeding up the performance of write.table, fwrite is ~17x faster than write.csv there (YMMV).


UPDATE 15.12.2015:

In the future there might be a fwrite function in the data.table package, see: https://github.com/Rdatatable/data.table/issues/580.
In this thread a GIST is linked, which provides a prototype for such a function speeding up the process by a factor of 2 (according to the author, https://gist.github.com/oseiskar/15c4a3fd9b6ec5856c89).

ORIGINAL ANSWER:

I had the same problems (trying to write even larger CSV files) and decided finally against using CSV files.

I would recommend you to use SQLite as it is much faster than dealing with CSV files:

require("RSQLite")
# Set up database
drv <- dbDriver("SQLite")
con <- dbConnect(drv, dbname = "test.db")
# Load example data
data(USArrests)
# Write data "USArrests" in table "USArrests" in database "test.db"
dbWriteTable(con, "arrests", USArrests)

# Test if the data was correctly stored in the database, i.e.
# run an exemplary query on the newly created database
dbGetQuery(con, "SELECT * FROM arrests WHERE Murder > 10")
# row_names Murder Assault UrbanPop Rape
# 1 Alabama 13.2 236 58 21.2
# 2 Florida 15.4 335 80 31.9
# 3 Georgia 17.4 211 60 25.8
# 4 Illinois 10.4 249 83 24.0
# 5 Louisiana 15.4 249 66 22.2
# 6 Maryland 11.3 300 67 27.8
# 7 Michigan 12.1 255 74 35.1
# 8 Mississippi 16.1 259 44 17.1
# 9 Nevada 12.2 252 81 46.0
# 10 New Mexico 11.4 285 70 32.1
# 11 New York 11.1 254 86 26.1
# 12 North Carolina 13.0 337 45 16.1
# 13 South Carolina 14.4 279 48 22.5
# 14 Tennessee 13.2 188 59 26.9
# 15 Texas 12.7 201 80 25.5

# Close the connection to the database
dbDisconnect(con)

For further information, see http://cran.r-project.org/web/packages/RSQLite/RSQLite.pdf

You can also use a software like http://sqliteadmin.orbmu2k.de/ to access the database and export the database to CSV etc.

--

How to export a large dataset from R to CSV?

We can use

write.csv(a, "Lucas1.csv", quote=FALSE, row.names=FALSE)

Writing multiple csv files with specified number of rows for large dataframe

This should work

library( data.table )
#create sample data
dt = data.table( 1:2000 )
#split dt into a list, based on (f = ) the integer division (+ 1) of the 'rownumbers'
# by the preferred chuncksize (99)
# use keep.by = TRUE to keep the integer division (+ 1) result
# for naming the files when saving
l <- split( dt, f = dt[, .I] %/% 99 + 1, keep.by = TRUE )
#simple for loop, writing each list element, using it's name in the filename
for (i in 1:length(l)) {
write.csv( data.frame( l[[i]]), file = paste0( "./testfile_", names(l[i]), ".csv" ) )
}

results in

Sample Image

Write a Data.Table as a csv file

Load the right package, look at its help page, search for "csv", follow the Usage section:

library(data.table)
help(pac=data.table)
fwrite(df2, file="~/test.csv") # for mac, need changing for other OS

Another approach might be:

 as.data.frame( lapply(df2, unlist) )

How to export a large table in csv format with 8 millions rows in mysql?

Have you tried with Select OUTFILE?

The solution:

SELECT 
orderNumber, status, orderDate, requiredDate, comments
FROM
orders
WHERE
status = 'Cancelled'
INTO OUTFILE 'C:/tmp/cancelled_orders.csv'
FIELDS ENCLOSED BY '"'
TERMINATED BY ';'
ESCAPED BY '"'
LINES TERMINATED BY '\r\n';

Large data table to multiple csv files of specific size in .net

Thanks @H.G.Sandhagen and @jdweng for the inputs. Currently I have written following code which does the work needed. I know it is not perfect and some enhancement can surely be done and can be made more efficient if we can pre-determine length out of data table item array as pointed out by Nick.McDermaid. As of now, I will go with this code to unblock my self and will post the final optimized version when I have it coded.

public void WriteToCsv(DataTable table, string path, int size)
{
int fileNumber = 0;
StreamWriter sw = new StreamWriter(string.Format(path, fileNumber), false);
//headers
for (int i = 0; i < table.Columns.Count; i++)
{
sw.Write(table.Columns[i]);
if (i < table.Columns.Count - 1)
{
sw.Write(",");
}
}
sw.Write(sw.NewLine);

foreach (DataRow row in table.AsEnumerable())
{
sw.WriteLine(string.Join(",", row.ItemArray.Select(x => x.ToString())));
if (sw.BaseStream.Length > size) // Time to create new file!
{
sw.Close();
sw.Dispose();
fileNumber ++;
sw = new StreamWriter(string.Format(path, fileNumber), false);
}
}

sw.Close();
}

How to append several large data.table objects into a single data.table and export to csv quickly without running out of memory?

Update 12/23/2013 - The following solution works all in R without running out of memory
(Thanks @AnandaMahto).

The major caveat with this method is that you must be absolutely sure that the files you reading in and writing out each time have exactly the same header columns, in exactly the same order, or your R processing code must ensure this since write.table does not check this for you.

for( loop through folders ){

for ( loop through files ) {

filename = list.files( ... )
file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
gc()

...do some processing to file...

# append file to the running master.file
if ( first time through inner loop) {
write.table(file,
"masterfile.csv",
sep = ",",
dec = ".",
qmethod = "double",
row.names = "FALSE")
} else {
write.table(file,
"masterfile.csv",
sep = ",",
dec = ".",
qmethod = "double",
row.names = "FALSE",
append = "TRUE",
col.names = "FALSE")
}
rm( file, filename )
gc()
}
gc()
}

My Initial Solution:

for( loop through folders ){

for ( loop through files ) {
filename = list.files( ... )
file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
gc()

...do some processing to file...

#write out the file
write.csv( file, ... )
rm( file, filename )
gc()
}
gc()
}

I then downloaded and installed GnuWin32's sed package and used Windows command line tools to append the files as follows:

copy /b *common_pattern*.csv master_file.csv

This appends together all of the individual .csv files whose names have the text pattern "common_pattern" in them, headers and all.

Then I use sed.exe to remove all but the first header line as follows:

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -i 2,${/header_pattern/d;} master_file.csv

-i tells sed to just overwrite the specified file (in-place).



2,$ tells sed to look at range from the 2nd row to the last row ($)



{/header_pattern/d;} tells sed to find all lines in the range with the text "header_pattern" in them and d delete these lines



In order to make sure this was doing what I wanted it to do, I first printed the lines I was planning to delete.

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -n 2,${/header_pattern/p;} master_file.csv

Works like a charm, I just wish I could do it all in R.



Related Topics



Leave a reply



Submit