Speeding Up the Performance of Write.Table

Speeding up the performance of write.table

If all of your columns are of the same class, convert to a matrix before writing out, provides a nearly 6x speed up. Also, you can look into using write.matrix() from package MASS, though it did not prove faster for this example. Maybe I didn't set something up properly:

#Fake data
m <- matrix(runif(256*65536), nrow = 256)
#AS a data.frame
system.time(write.csv(as.data.frame(m), "dataframe.csv"))
#----------
#   user  system elapsed 
# 319.53   13.65  333.76 

#As a matrix
system.time(write.csv(m, "matrix.csv"))
#----------
#   user  system elapsed 
#  52.43    0.88   53.59 

#Using write.matrix()
require(MASS)
system.time(write.matrix(m, "writematrix.csv"))
#----------
#   user  system elapsed 
# 113.58   59.12  172.75

EDIT

To address the concern raised below that the results above are not fair to data.frame, here are some more results and timing to show that the overall message is still "convert your data object to a matrix if possible. If not possible, deal with it. Alternatively, reconsider why you need to write out a 200MB+ file in CSV format if the timing is of the utmost importance":

#This is a data.frame
m2 <- as.data.frame(matrix(runif(256*65536), nrow = 256))
#This is still 6x slower
system.time(write.csv(m2, "dataframe.csv"))
#   user  system elapsed 
# 317.85   13.95  332.44
#This even includes the overhead in converting to as.matrix in the timing 
system.time(write.csv(as.matrix(m2), "asmatrix.csv"))
#   user  system elapsed 
#  53.67    0.92   54.67

So, nothing really changes. To confirm this is reasonable, consider the relative time costs of as.data.frame():

m3 <- as.matrix(m2)
system.time(as.data.frame(m3))
#   user  system elapsed 
#   0.77    0.00    0.77

So, not really a big deal or skewing information as much as the comment below would believe. If you're still not convinced that using write.csv() on large data.frames is a bad idea performance wise, consult the manual under the Note:

write.table can be slow for data frames with large numbers (hundreds or more) of
columns: this is inevitable as each column could be of a different class and so must be
handled separately. If they are all of the same class, consider using a matrix instead.

Finally, consider moving to a native RData object if you're still losing sleep over saving things faster

system.time(save(m2, file = "thisisfast.RData"))
#   user  system elapsed 
#  21.67    0.12   21.81

R: speed up writing a huge data.frame to a text file?

Many people use write.csv() to write to a flatfile. However, there is a relatively new library called 'readr', that reads/writes much more quickly.

http://cran.r-project.org/web/packages/readr/readr.pdf

This is about twice as fast as write.csv, and never writes row names.

There, I got you down to 1.5 days. (and still running)

Other tricks are

Only write the data you need: so row.names are an obvious one.
use round() to round any numerical fields to the minimum number of decimal places that you need.
Benchmark. Try writing to disk 1% of your data. Time it, try some tricks or packages I mentioned, and time it again, see what works.

How can I speed up my R code when multiplying across tables?

Think this is a little faster than your method

vec <- dtB[colnames(dtA), multi, on = 'id']
dtA <- as.data.table(mapply(function(x,y) x*y, dtA, vec, SIMPLIFY = FALSE))

Can't really see where you could get a significant speed up though, as I don't think you are doing anything particularly wasteful in your attempt.

EDIT:

This is about 50% faster again on larger examples

cn <- colnames(dtA)
dtA[,(cn) := mapply(function(x,y) x*y, dtA, dtB[cn, multi, on = 'id'], SIMPLIFY = FALSE)]

Does Relations between tables in database speed up performance of queries?

For boosting performance, you should use indexes, use appropriate datatypes as well (storing number as string takes more space and comparing may be less efficient).

Relations between tables, i.e. foreign key are constraints, so you cannot enter new value to referenced table without referencing records in other table - it is a way to keep data integrity, eg.

Table1
id  table2_id
1   1
2   1
3   3

Table2
id  some_column
1   123
2   123
3   null

Here, Table1.table2_id references Table2.id. Now you won't be able to insert such row to Table1: 4, 4, because there's no id = 4 in Table2.

Insertion speed slowdown as the table grows in mysql

Edit your /etc/mysql/my.cnf file and make sure you allocate enough memory to the InnoDB buffer pool. If this is a dedicated sever, you could probably use up to 80% of your system memory.

# Provide a buffer pool for InnoDB - up to 80% of memory for a dedicated database server
innodb_buffer_pool_size=614M

The primary keys are B Trees so inserts will always take O(logN) time and once you run out of cache, they will start swapping like mad. When this happens, you will probably want to partition the data to keep your insertion speed up. See http://dev.mysql.com/doc/refman/5.1/en/partitioning.html for more info on partitioning.

Good luck!

Speeding up an UPDATE in MySQL

@WilsonHauck This project has been going through a lot of optimizations. There are several tables to be migrated and many more millions of records, I have experimented with many things like different buffer sizes, using MEMORY engines, etc. I have benchmarks in place and they didn't work for my use case.

@RickJames This particular UPDATE was the last statement that resisted optimization. Indeed, working on small ranges has been the key, I got a 5x speed up in my test environment for this particular statement. The compromise has been ranges of 5K and a pool of 20 threads (there are other threads doing other work in parallel). The test machine has 8 cores, but the production machine has 48, so I expect the speed up to be even greater.

I would like to understand the lock errors I was getting when ranges were on the order of hundreds of thousands (I mean, to actually know what they were rather than conjecture, and so to understand why they are not present in small ranges), and also to understand why I need to hand-code a more performant version of the update.

But that is just to better understand the details, this 5x speed up is incredible and enough for my purposes.

BTW, I believe an I/O bound task can precisely use more threads than cores, because you have wait times in the CPU that other threads can leverage. It is for CPU-bound tasks that you won't squeeze more performance with more threads.

@Solarflare since the multithread approach is what I was looking for, I didn't experiement with the STRAIGHT JOIN, but in the new approach cardinalities are reversed, and MySQL starts with invoice now. Perhaps we got an extra boost also from starting there, as per your remark.

Database speed optimization: few tables with many rows, or many tables with few rows?

Smaller tables are faster. Period.

If you have history that is rarely used, then getting the history into other tables will be faster.

This is what a data warehouse is about -- separate operational data from historical data.

You can run a periodic extract from operational and a load to historical. All the data is kept, it's just segregated.

Speeding Up the Performance of Write.Table