Write.CSV for Large Data.Table

write.csv for large data.table

UPDATE 2019.01.07:

fwrite has been on CRAN since 2016-11-25.

install.packages("data.table")

UPDATE 08.04.2016:

fwrite has been recently added to the data.table package's development version. It also runs in parallel (implicitly).

# Install development version of data.table
install.packages("data.table", 
                  repos = "https://Rdatatable.github.io/data.table", type = "source")

# Load package
library(data.table)

# Load data        
data(USArrests)

# Write CSV
fwrite(USArrests, "USArrests_fwrite.csv")

According to the detailed benchmark tests shown under speeding up the performance of write.table, fwrite is ~17x faster than write.csv there (YMMV).

UPDATE 15.12.2015:

In the future there might be a fwrite function in the data.table package, see: https://github.com/Rdatatable/data.table/issues/580.
In this thread a GIST is linked, which provides a prototype for such a function speeding up the process by a factor of 2 (according to the author, https://gist.github.com/oseiskar/15c4a3fd9b6ec5856c89).

ORIGINAL ANSWER:

I had the same problems (trying to write even larger CSV files) and decided finally against using CSV files.

I would recommend you to use SQLite as it is much faster than dealing with CSV files:

require("RSQLite")
# Set up database    
drv <- dbDriver("SQLite")
con <- dbConnect(drv, dbname = "test.db")
# Load example data
data(USArrests)
# Write data "USArrests" in table "USArrests" in database "test.db"    
dbWriteTable(con, "arrests", USArrests)

# Test if the data was correctly stored in the database, i.e. 
# run an exemplary query on the newly created database 
dbGetQuery(con, "SELECT * FROM arrests WHERE Murder > 10")       
# row_names Murder Assault UrbanPop Rape
# 1         Alabama   13.2     236       58 21.2
# 2         Florida   15.4     335       80 31.9
# 3         Georgia   17.4     211       60 25.8
# 4        Illinois   10.4     249       83 24.0
# 5       Louisiana   15.4     249       66 22.2
# 6        Maryland   11.3     300       67 27.8
# 7        Michigan   12.1     255       74 35.1
# 8     Mississippi   16.1     259       44 17.1
# 9          Nevada   12.2     252       81 46.0
# 10     New Mexico   11.4     285       70 32.1
# 11       New York   11.1     254       86 26.1
# 12 North Carolina   13.0     337       45 16.1
# 13 South Carolina   14.4     279       48 22.5
# 14      Tennessee   13.2     188       59 26.9
# 15          Texas   12.7     201       80 25.5

# Close the connection to the database
dbDisconnect(con)

For further information, see http://cran.r-project.org/web/packages/RSQLite/RSQLite.pdf

You can also use a software like http://sqliteadmin.orbmu2k.de/ to access the database and export the database to CSV etc.

How to export a large dataset from R to CSV?

We can use

write.csv(a, "Lucas1.csv", quote=FALSE, row.names=FALSE)

Writing multiple csv files with specified number of rows for large dataframe

This should work

library( data.table )
#create sample data    
dt = data.table( 1:2000 )
#split dt into a list, based on (f = ) the integer division (+ 1) of the 'rownumbers' 
# by the preferred chuncksize  (99)
# use keep.by = TRUE to keep the integer division (+ 1) result 
# for naming the files when saving
l <- split( dt, f = dt[, .I] %/% 99 + 1, keep.by = TRUE )
#simple for loop, writing each list element, using it's name in the filename
for (i in 1:length(l)) {
  write.csv( data.frame( l[[i]]), file = paste0( "./testfile_", names(l[i]), ".csv" ) )
}

results in

Sample Image

Write a Data.Table as a csv file

Load the right package, look at its help page, search for "csv", follow the Usage section:

library(data.table)
help(pac=data.table)
fwrite(df2, file="~/test.csv") # for mac, need changing for other OS

Another approach might be:

 as.data.frame( lapply(df2, unlist) )

How to export a large table in csv format with 8 millions rows in mysql?

Have you tried with Select OUTFILE?

The solution:

SELECT 
    orderNumber, status, orderDate, requiredDate, comments
FROM
    orders
WHERE
    status = 'Cancelled' 
INTO OUTFILE 'C:/tmp/cancelled_orders.csv' 
FIELDS ENCLOSED BY '"' 
TERMINATED BY ';' 
ESCAPED BY '"' 
LINES TERMINATED BY '\r\n';

Large data table to multiple csv files of specific size in .net

Thanks @H.G.Sandhagen and @jdweng for the inputs. Currently I have written following code which does the work needed. I know it is not perfect and some enhancement can surely be done and can be made more efficient if we can pre-determine length out of data table item array as pointed out by Nick.McDermaid. As of now, I will go with this code to unblock my self and will post the final optimized version when I have it coded.

public void WriteToCsv(DataTable table, string path, int size)
        {
            int fileNumber = 0;
            StreamWriter sw = new StreamWriter(string.Format(path, fileNumber), false);
            //headers  
            for (int i = 0; i < table.Columns.Count; i++)
            {
                sw.Write(table.Columns[i]);
                if (i < table.Columns.Count - 1)
                {
                    sw.Write(",");
                }
            }
            sw.Write(sw.NewLine);

            foreach (DataRow row in table.AsEnumerable())
            {
                sw.WriteLine(string.Join(",", row.ItemArray.Select(x => x.ToString())));
                if (sw.BaseStream.Length > size) // Time to create new file!
                {
                    sw.Close();
                    sw.Dispose();
                    fileNumber ++;
                    sw = new StreamWriter(string.Format(path, fileNumber), false);
                }
            }

            sw.Close();
        }

How to append several large data.table objects into a single data.table and export to csv quickly without running out of memory?

Update 12/23/2013 - The following solution works all in R without running out of memory
(Thanks @AnandaMahto).

The major caveat with this method is that you must be absolutely sure that the files you reading in and writing out each time have exactly the same header columns, in exactly the same order, or your R processing code must ensure this since write.table does not check this for you.

for( loop through folders ){

    for ( loop through files ) {

        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        # append file to the running master.file
        if ( first time through inner loop) {
            write.table(file, 
                        "masterfile.csv", 
                        sep = ",", 
                        dec = ".", 
                        qmethod = "double", 
                        row.names = "FALSE")
        } else {
            write.table(file,
                        "masterfile.csv",
                        sep = ",",
                        dec = ".",
                        qmethod = "double",
                        row.names = "FALSE",
                        append = "TRUE",
                        col.names = "FALSE")
        }
        rm( file, filename )
        gc()
    }
    gc()
}

My Initial Solution:

for( loop through folders ){

    for ( loop through files ) {
        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        #write out the file
        write.csv( file, ... )
        rm( file, filename )
        gc()
    }        
    gc()
}

I then downloaded and installed GnuWin32's sed package and used Windows command line tools to append the files as follows:

copy /b *common_pattern*.csv master_file.csv

This appends together all of the individual .csv files whose names have the text pattern "common_pattern" in them, headers and all.

Then I use sed.exe to remove all but the first header line as follows:

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -i 2,${/header_pattern/d;} master_file.csv

-i tells sed to just overwrite the specified file (in-place).

2,$ tells sed to look at range from the 2nd row to the last row ($)

{/header_pattern/d;} tells sed to find all lines in the range with the text "header_pattern" in them and d delete these lines

In order to make sure this was doing what I wanted it to do, I first printed the lines I was planning to delete.

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -n 2,${/header_pattern/p;} master_file.csv

Works like a charm, I just wish I could do it all in R.

Write.CSV for Large Data.Table