How to Write a Large Buffer into a Binary File in C++, Fast

How to write a large buffer into a binary file in C++, fast?

This did the job (in the year 2012):

#include <stdio.h>
const unsigned long long size = 8ULL*1024ULL*1024ULL;
unsigned long long a[size];

int main()
{
FILE* pFile;
pFile = fopen("file.binary", "wb");
for (unsigned long long j = 0; j < 1024; ++j){
//Some calculations to fill a[]
fwrite(a, 1, size*sizeof(unsigned long long), pFile);
}
fclose(pFile);
return 0;
}

I just timed 8GB in 36sec, which is about 220MB/s and I think that maxes out my SSD. Also worth to note, the code in the question used one core 100%, whereas this code only uses 2-5%.

Thanks a lot to everyone.

Update: 5 years have passed it's 2017 now. Compilers, hardware, libraries and my requirements have changed. That's why I made some changes to the code and did some new measurements.

First up the code:

#include <fstream>
#include <chrono>
#include <vector>
#include <cstdint>
#include <numeric>
#include <random>
#include <algorithm>
#include <iostream>
#include <cassert>

std::vector<uint64_t> GenerateData(std::size_t bytes)
{
assert(bytes % sizeof(uint64_t) == 0);
std::vector<uint64_t> data(bytes / sizeof(uint64_t));
std::iota(data.begin(), data.end(), 0);
std::shuffle(data.begin(), data.end(), std::mt19937{ std::random_device{}() });
return data;
}

long long option_1(std::size_t bytes)
{
std::vector<uint64_t> data = GenerateData(bytes);

auto startTime = std::chrono::high_resolution_clock::now();
auto myfile = std::fstream("file.binary", std::ios::out | std::ios::binary);
myfile.write((char*)&data[0], bytes);
myfile.close();
auto endTime = std::chrono::high_resolution_clock::now();

return std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count();
}

long long option_2(std::size_t bytes)
{
std::vector<uint64_t> data = GenerateData(bytes);

auto startTime = std::chrono::high_resolution_clock::now();
FILE* file = fopen("file.binary", "wb");
fwrite(&data[0], 1, bytes, file);
fclose(file);
auto endTime = std::chrono::high_resolution_clock::now();

return std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count();
}

long long option_3(std::size_t bytes)
{
std::vector<uint64_t> data = GenerateData(bytes);

std::ios_base::sync_with_stdio(false);
auto startTime = std::chrono::high_resolution_clock::now();
auto myfile = std::fstream("file.binary", std::ios::out | std::ios::binary);
myfile.write((char*)&data[0], bytes);
myfile.close();
auto endTime = std::chrono::high_resolution_clock::now();

return std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count();
}

int main()
{
const std::size_t kB = 1024;
const std::size_t MB = 1024 * kB;
const std::size_t GB = 1024 * MB;

for (std::size_t size = 1 * MB; size <= 4 * GB; size *= 2) std::cout << "option1, " << size / MB << "MB: " << option_1(size) << "ms" << std::endl;
for (std::size_t size = 1 * MB; size <= 4 * GB; size *= 2) std::cout << "option2, " << size / MB << "MB: " << option_2(size) << "ms" << std::endl;
for (std::size_t size = 1 * MB; size <= 4 * GB; size *= 2) std::cout << "option3, " << size / MB << "MB: " << option_3(size) << "ms" << std::endl;

return 0;
}

This code compiles with Visual Studio 2017 and g++ 7.2.0 (a new requirements).
I ran the code with two setups:

  • Laptop, Core i7, SSD, Ubuntu 16.04, g++ Version 7.2.0 with -std=c++11 -march=native -O3
  • Desktop, Core i7, SSD, Windows 10, Visual Studio 2017 Version 15.3.1 with /Ox /Ob2 /Oi /Ot /GT /GL /Gy

Which gave the following measurements (after ditching the values for 1MB, because they were obvious outliers):
Sample Image
Sample Image
Both times option1 and option3 max out my SSD. I didn't expect this to see, because option2 used to be the fastest code on my old machine back then.

TL;DR: My measurements indicate to use std::fstream over FILE.

C++ Any faster method to write a large binary file?

I'd suggest removing the substr call in the inner loop. You are allocating a new string and then destroying it for each character that you process. Replace this code:

for(::UINT Index2 = 7, Inc = 1; Index2 + 1 != 0; -- Index2, Inc += Inc )
if( BinaryStr.substr( Index1 * 8, 8 )[ Index2 ] == '1' )
Dec += Inc;

by something like:

for(::UINT Index2 = 7, Inc = 1; Index2 + 1 != 0; -- Index2, Inc += Inc )
if( BinaryStr[Index1 * 8 + Index2 ] == '1' )
Dec += Inc;

Write binary file to disk super fast in MEX

As indicated in some posts very large buffers tend to decrease performance. So the buffer is written to the file part by part. For me 8 MiB gives the best performance.

void writeBinFilePartByPart(int16_t *int_data, size_t size)
{
size_t part = 8 * 1024 * 1024;

size = size * sizeof(int16_t);

char *data = reinterpret_cast<char *> (int_data);

HANDLE file = CreateFileA (
"windows_test.bin",
GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_FLAG_SEQUENTIAL_SCAN,
NULL);

// Expand file size
SetFilePointer (file, size, NULL, FILE_BEGIN);
SetEndOfFile (file);
SetFilePointer (file, 0, NULL, FILE_BEGIN);

DWORD written;
if (size < part)
{
WriteFile (file, data, size, &written, NULL);
CloseHandle (file);
return;
}

size_t rem = size % part;
for (size_t i = 0; i < size-rem; i += part)
{
WriteFile (file, data+i, part, &written, NULL);
}

if (rem)
WriteFile (file, data+size-rem, rem, &written, NULL);

CloseHandle (file);
}

The output is compared to C++ Std lib method that is mentioned by @Cris Luengo :

Sample Image

Fastest way to read every 30th byte of large binary file?

Performance test. If you want to use it yourself, note that the integrity check (printing total) only works if "step" divides BUFSZ, and MEGS is small enough that you don't read off the end of the file. This is due to (a) laziness, (b) desire not to obscure the real code. rand1.data is a few GB copied from /dev/urandom using dd.

#include <stdio.h>
#include <stdlib.h>

const long long size = 1024LL*1024*MEGS;
const int step = 32;

int main() {
FILE *in = fopen("/cygdrive/c/rand1.data", "rb");
int total = 0;
#if SEEK
long long i = 0;
char buf[1];
while (i < size) {
fread(buf, 1, 1, in);
total += (unsigned char) buf[0];
fseek(in, step - 1, SEEK_CUR);
i += step;
}
#endif
#ifdef BUFSZ
long long i = 0;
char buf[BUFSZ];
while (i < size) {
fread(buf, BUFSZ, 1, in);
i += BUFSZ;
for (int j = 0; j < BUFSZ; j += step)
total += (unsigned char) buf[j];
}
#endif
printf("%d\n", total);
}

Results:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real 0m1.391s
user 0m0.030s
sys 0m0.030s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real 0m0.172s
user 0m0.108s
sys 0m0.046s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=20 && time ./buff2
83595817

real 0m0.031s
user 0m0.030s
sys 0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=20 && time ./buff2
83595817

real 0m0.141s
user 0m0.140s
sys 0m0.015s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DSEEK -DMEGS=20 && time ./buff2
83595817

real 0m20.797s
user 0m1.733s
sys 0m9.140s

Summary:

I'm using 20MB of data initially, which of course fits in cache. The first time I read it (using a 32KB buffer) takes 1.4s, bringing it into cache. The second time (using a 32 byte buffer) takes 0.17s. The third time (back with the 32KB buffer again) takes 0.03s, which is too close to the granularity of my timer to be meaningful. fseek takes over 20s, even though the data is already in disk cache.

At this point I'm pulling fseek out of the ring so the other two can continue:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real 0m33.437s
user 0m0.749s
sys 0m1.562s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real 0m6.078s
user 0m5.030s
sys 0m0.484s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real 0m1.141s
user 0m0.280s
sys 0m0.500s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=1000 && time ./buff2
-117681741

real 0m6.094s
user 0m4.968s
sys 0m0.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=1000 && time ./buff2
-117681741

real 0m1.140s
user 0m0.171s
sys 0m0.640s

1000MB of data also appears to be substantially cached. A 32KB buffer is 6 times faster than a 32 byte buffer. But the difference is all user time, not time spent blocked on disk I/O. Now, 8000MB is much more than I have RAM, so I can avoid caching:

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real 3m25.515s
user 0m5.155s
sys 0m12.640s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32 -DMEGS=8000 && time ./buff2
-938074821

real 3m59.015s
user 1m11.061s
sys 0m10.999s

$ gcc -std=c99 buff2.c -obuff2 -O3 -DBUFSZ=32*1024 -DMEGS=8000 && time ./buff2
-938074821

real 3m42.423s
user 0m5.577s
sys 0m14.484s

Ignore the first of those three, it benefited from the first 1000MB of the file already being in RAM.

Now, the version with the 32KB is only slightly faster in wall clock time (and I can't be bothered to re-run, so let's ignore it for now), but look at the difference in user+sys time: 20s vs. 82s. I think that my OS's speculative read-ahead disk caching has saved the 32-byte buffer's bacon here: while the 32 byte buffer is being slowly refilled, the OS is loading the next few disk sectors even though nobody has asked for them. Without that I suspect it would have been a minute (20%) slower than the 32KB buffer, which spends less time in user-land before requesting the next read.

Moral of the story: standard I/O buffering doesn't cut it in my implementation, the performance of fseek is atrocious as the questioner says. When the file is cached in the OS, buffer size is a big deal. When the file is not cached in the OS, buffer size doesn't make a whole lot of difference to wall clock time, but my CPU was busier.

incrediman's fundamental suggestion to use a read buffer is vital, since fseek is appalling. Arguing over whether the buffer should be a few KB or a few hundred KB is most likely pointless on my machine, probably because the OS has done a job of ensuring that the operation is tightly I/O bound. But I'm pretty sure this is down to OS disk read-ahead, not standard I/O buffering, because if it was the latter then fseek would be better than it is. Actually, it could be that the standard I/O is doing the read ahead, but a too-simple implementation of fseek is discarding the buffer every time. I haven't looked into the implementation (and I couldn't follow it across the boundary into the OS and filesystem drivers if I did).

How to concurrently write to a file in c++(in other words, whats the fastest way to write to a file)

The fastest method to write to a file is to use hardware assist. Write your output to memory (a.k.a. buffer), then tell the hardware device to transfer from memory to the file (disk).

The next fastest method is to write all the data to a buffer then block write the data to the file. If you want other tasks or threads to execute during your writing, then create a thread that writes the buffer to the file.

When writing to a file, the more data per transaction, the more efficient the write will be. For example, 1 write of 1024 bytes is faster than 1024 writes of one byte.

The idea is to keep the data streaming. Slowing down the transfer rate may be faster than a burst write, delay, burst write, delay, etc.

Remember that the disk is essentially a serial device (unless you have a special hard drive). Bits are laid down on the platters using a bit stream. Writing data in parallel will have adverse effects because the head will have to be moved between the parallel activities.

Remember that if you use more than one core, there will be more traffic on the data bus. The transfer to the file will have to pause while other threads/tasks are using the data bus. So, if you can, block all tasks, then transfer your data. :-)

I've written programs that copy from slow memory to fast memory, then transferred from fast memory to the hard drive. That was also using interrupts (threads).

Summary

Fast writing to a file involves:

  1. Keep the data streaming; minimize the pauses.
  2. Write in binary mode (no translations, please).
  3. Write in blocks (format into memory as necessary before writing the block).
  4. Maximize the data in a transaction.
  5. Use separate writing thread, if you want other tasks running "concurrently".
  6. The hard drive is a serial device, not parallel. Bits are written to the platters in a serial stream.


Related Topics



Leave a reply



Submit