Posix_Fadvise(Willneed) Makes Io Slower

What is the Fastest Method for High Performance Sequential File I/O in C++?

Are there generally accepted guidelines for achieving the fastest possible sequential file I/O in C++?

Rule 0: Measure. Use all available profiling tools and get to know them. It's almost a commandment in programming that if you didn't measure it you don't know how fast it is, and for I/O this is even more true. Make sure to test under actual work conditions if you possibly can. A process that has no competition for the I/O system can be over-optimized, fine-tuned for conditions that don't exist under real loads.

  1. Use mapped memory instead of writing to files. This isn't always faster but it allows the opportunity to optimize the I/O in an operating system-specific but relatively portable way, by avoiding unnecessary copying, and taking advantage of the OS's knowledge of how the disk actually being used. ("Portable" if you use a wrapper, not an OS-specific API call).

  2. Try and linearize your output as much as possible. Having to jump around memory to find the buffers to write can have noticeable effects under optimized conditions, because cache lines, paging and other memory subsystem issues will start to matter. If you have lots of buffers look into support for scatter-gather I/O which tries to do that linearizing for you.

Some possible considerations:

  • Guidelines for choosing the optimal buffer size

Page size for starters, but be ready to tune from there.

  • Will a portable library like boost::asio be too abstracted to expose the intricacies
    of a specific platform, or can they be assumed to be optimal?

Don't assume it's optimal. It depends on how thoroughly the library gets exercised on your platform, and how much effort the developers put into making it fast. Having said that a portable I/O library can be very fast, because fast abstractions exist on most systems, and it's usually possible to come up with a general API that covers a lot of the bases. Boost.Asio is, to the best of my limited knowledge, fairly fine tuned for the particular platform it is on: there's a whole family of OS and OS-variant specific APIs for fast async I/O (e.g. epoll, /dev/epoll, kqueue, Windows overlapped I/O), and Asio wraps them all.

  • Is asynchronous I/O always preferable to synchronous? What if the application is not otherwise CPU-bound?

Asynchronous I/O isn't faster in a raw sense than synchronous I/O. What asynchronous I/O does is ensure that your code is not wasting time waiting for the I/O to complete. It is faster in a general way than the other method of not wasting that time, namely using threads, because it will call back into your code when I/O is ready and not before. There are no false starts or concerns with idle threads needing to be terminated.

python multiprocessing read file cost too much time

You have a few problems. First, you're not parallelizing. You do:

result = pool.apply_async(gainOneFile,(abs_from_filename,)) 
arr.append(result.get())

over and over, dispatching a task, then immediately calling .get() which waits for it to complete before you dispatch any additional tasks; you never actually have more than one worker running at once. Store all the results without calling .get(), then call .get() later. Or just use Pool.map or related methods and save yourself some hassle from manual individual result management, e.g. (using imap_unordered to minimize overhead since you're just sorting anyway):

# Make generator of paths to load
paths = (os.path.join(path, "outputDict"+str(i)) for i in xrange(1, 40))
# Load them all in parallel, and sort the results by length (lambda is redundant)
arr = sorted(pool.imap_unordered(gainOneFile, paths), key=len)

Second, multiprocessing has to pickle and unpickle all arguments and return values sent between the main process and the workers, and it's all sent over pipes that incur system call overhead to boot. Since your file system isn't likely to gain substantial speed from parallelizing the reads, it's likely to be a net loss, not a gain.

You might be able to get a bit of a boost by switching to a thread based pool; change the import to import multiprocessing.dummy as mp and you'll get a version of Pool implemented in terms of threads; they don't work around the CPython GIL, but since this code is almost certainly I/O bound, that hardly matters, and it removes the pickling and unpickling as well as the IPC involved in worker communications.

Lastly, if you're using Python 3.3 or higher on a UNIX like system, you may be able to get the OS to help you out by having it pull files into the system cache more aggressively. If you can open the file, then use os.posix_fadvise on the file descriptor (.fileno() on file objects) with either WILLNEED or SEQUENTIAL it might improve read performance when you read from the file at some later point by aggressively prefetching file data before you request it.

mmap problem, allocates huge amounts of memory

No, what you're doing is mapping the file into memory. This is different to actually reading the file into memory.

Were you to read it in, you would have to transfer the entire contents into memory. By mapping it, you let the operating system handle it. If you attempt to read or write to a location in that memory area, the OS will load the relevant section for you first. It will not load the entire file unless the entire file is needed.

That is where you get your performance gain. If you map the entire file but only change one byte then unmap it, you'll find that there's not much disk I/O at all.

Of course, if you touch every byte in the file, then yes, it will all be loaded at some point but not necessarily in physical RAM all at once. But that's the case even if you load the entire file up front. The OS will swap out parts of your data if there's not enough physical memory to contain it all, along with that of the other processes in the system.

The main advantages of memory mapping are:

  • you defer reading the file sections until they're needed (and, if they're never needed, they don't get loaded). So there's no big upfront cost as you load the entire file. It amortises the cost of loading.
  • The writes are automated, you don't have to write out every byte. Just close it and the OS will write out the changed sections. I think this also happens when the memory is swapped out as well (in low physical memory situations), since your buffer is simply a window onto the file.

Keep in mind that there is most likely a disconnect between your address space usage and your physical memory usage. You can allocate an address space of 4G (ideally, though there may be OS, BIOS or hardware limitations) in a 32-bit machine with only 1G of RAM. The OS handles the paging to and from disk.

And to answer your further request for clarification:

Just to clarify. So If I need the entire file, mmap will actually load the entire file?

Yes, but it may not be in physical memory all at once. The OS will swap out bits back to the filesystem in order to bring in new bits.

But it will also do that if you've read the entire file in manually. The difference between those two situations is as follows.

With the file read into memory manually, the OS will swap parts of your address space (may include the data or may not) out to the swap file. And you will need to manually rewrite the file when your finished with it.

With memory mapping, you have effectively told it to use the original file as an extra swap area for that file/memory only. And, when data is written to that swap area, it affects the actual file immediately. So no having to manually rewrite anything when you're done and no affecting the normal swap (usually).

It really is just a window to the file:

     
     
     
     
memory mapped file image



Related Topics



Leave a reply



Submit