How Often Does Python Flush to a File

How often does python flush to a file?

For file operations, Python uses the operating system's default buffering unless you configure it do otherwise. You can specify a buffer size, unbuffered, or line buffered.

For example, the open function takes a buffer size argument.

http://docs.python.org/library/functions.html#open

"The optional buffering argument specifies the file’s desired buffer size:"

  • 0 means unbuffered,
  • 1 means line buffered,
  • any other positive value means use a buffer of (approximately) that size.
  • A negative buffering means to use the system default, which is usually line buffered for tty devices and fully buffered for other files.
  • If omitted, the system default is used.

code:

bufsize = 0
f = open('file.txt', 'w', buffering=bufsize)

What exactly is file.flush() doing?

There's typically two levels of buffering involved:

  1. Internal buffers
  2. Operating system buffers

The internal buffers are buffers created by the runtime/library/language that you're programming against and is meant to speed things up by avoiding system calls for every write. Instead, when you write to a file object, you write into its buffer, and whenever the buffer fills up, the data is written to the actual file using system calls.

However, due to the operating system buffers, this might not mean that the data is written to disk. It may just mean that the data is copied from the buffers maintained by your runtime into the buffers maintained by the operating system.

If you write something, and it ends up in the buffer (only), and the power is cut to your machine, that data is not on disk when the machine turns off.

So, in order to help with that you have the flush and fsync methods, on their respective objects.

The first, flush, will simply write out any data that lingers in a program buffer to the actual file. Typically this means that the data will be copied from the program buffer to the operating system buffer.

Specifically what this means is that if another process has that same file open for reading, it will be able to access the data you just flushed to the file. However, it does not necessarily mean it has been "permanently" stored on disk.

To do that, you need to call the os.fsync method which ensures all operating system buffers are synchronized with the storage devices they're for, in other words, that method will copy data from the operating system buffers to the disk.

Typically you don't need to bother with either method, but if you're in a scenario where paranoia about what actually ends up on disk is a good thing, you should make both calls as instructed.


Addendum in 2018.

Note that disks with cache mechanisms is now much more common than back in 2013, so now there are even more levels of caching and buffers involved. I assume these buffers will be handled by the sync/flush calls as well, but I don't really know.

is it necessary to call flush method of file handler in python

The file objects in the io module (the ones you get from open) and everywhere else you'd expect in the stdlib always flush when they close, or rely on platform APIs that are guaranteed to do so.

Even third-party libraries are required to "close and flush the stream" on their close methods if they want their objects to be file objects.1


The main reason to call flush is when you're not closing the file yet, but some other program might want to see the contents.


For example, a lot of people write code like this:

with open('dump.txt', 'w') as f:
while True:
buf = read_off_some_thingy()
f.write(buf.decode())
time.sleep(5)

… and then they wonder why when they cat dump.txt or open it in Notepad or whatever, it's empty, or missing the last 3 lines, or cuts off in the middle of a line. That's the problem flush solves:

with open('dump.txt', 'w') as f:
while True:
buf = read_off_some_thingy()
f.write(buf.decode())
f.flush()
time.sleep(5)

Or, alternatively, they're running the same code, but the problem is that someone might pull the plug on the computer (or, more likely nowadays, kill your container), and then after restart they'll have a corrupt file that cuts off in mid-line and now the perl script that scans the output won't run and nobody wants to debug perl code. Different problem, same solution.


But if you know for a fact that the file is going to be closed by some point (say, because there's a with statement that ends before there), and you don't need the file to be done before that point, you don't need to call flush.


You didn't mention fsync, which is a whole other issue—and a whole lot more complicated than most people thing—so I won't get into it. But the question you linked already covers the basics.


1. There's always the chance that you're using some third-party library with a file-like object that duck-types close enough to a file object for your needs, but isn't one. And such a type might have a close that doesn't flush. But I honestly don't think I've ever seen an object that had a flush method, that didn't call it on close.

does close() imply flush() in Python?

Yes. It uses the underlying close() function which does that for you (source).

Why are the contents of the file written only after it is closed?

You are seeing the effects of buffering. Disk I/O uses buffers to improve performance, and you have not written enough data to the buffer for it to flush.

Write more data, or close the file, both cause the buffer to be flushed. Alternatively, set the buffer size to a very small number (the number of bytes the buffer will hold):

with open('test.txt', 'w', 2) as ffile:

The options 0 and 1 have special meaning; 0 would disable buffering altogether (only available for binary mode files) and 1 is the default for text files (using line buffering, write a newline to flush).

That also means that if you have a text file, you could write a newline to trigger flush:

ffile.write('\n')

Last but not least, you could flush explicitly by using the file.flush() method:

ffile.flush()

How to flush the file write buffer in running python process with no file.close() statement?

1) Get the PID of your python process

pgrep python

2) List file descriptors

ls -l /proc/{PID}/fd

3) Open gdb

$ gdb
(gdb) attach {PID}
(gdb) call fflush({FILE_DESCRIPTOR})
(gdb) detach

4) Check your file

What does it mean to flush file contents in Python?

Python buffers writes to files. That is, file.write returns before the data is actually written to your hard drive. The main motivation of this is that a few large writes are much faster than many tiny writes, so by saving up the output of file.write until a bit has accumulated, Python can maintain good writing speeds.

file.flush forces the data to be written out at that moment. This is hand when you know that it might be a while before you have more data to write out, but you want other processes to be able to view the data you've already written. Imagine a log file that grows slowly. You don't want to have to wait ages before enough entries have built up to cause the data to be written out in one big chunk.

In either case, file.close causes the remaining data to be flushed, so "quux" in your code will be written out as soon as file (which is a really bad name as it shadows the builtin file constructor) falls out of scope of the with context manager and gets closed.

Note: your OS does some buffering of its own, but I believe every OS where Python is implemented will honor file.flush's request to write data out to the drive. Someone please correct me if I'm wrong.

By the way, "no-op" means "no operation", as in it won't actually do anything. For example, StringIO objects manipulate strings in memory, not files on your hard drive. StringIO.flush probably just immediately returns because there's not really anything for it to do.

Side-effect of flush in writing file

Side effects? I dont exactly understand what you mean but let me have ago at it anyway.

For file operations, Python uses the operating system's default buffering unless you configure it do otherwise. You can specify a buffer size, unbuffered, or line buffered. So If you are constantly using flush there is constant IO going on and If you are flushing out large amounts of data (i.e. buffer being big) this could slow does other running programs which could end up waiting for IO.

Fast and frequent IO operations are not good for the life of a harddisk, it increases changes of disk crashes.

Typically the pattern I follow is after all the writing to file object is done, flush is done at the end before closing the file.

Something for you to think about, are there other threads or programs reading from the same file as you are writing it? If this is the case you might get into trouble! Corrupt files are very much a possibility here. If you are considering using file as a persistent data store. Then its the wrong way to do it. Why not consider using a persistent-DB (like mysql or even sqlite) instead of using a file as a data store.



Related Topics



Leave a reply



Submit