Two Processes Write to One File, Prevent Mixing The Output

two processes write to one file, prevent mixing the output

Now I want to make sure that this solution is bulletproof. I can not
find any relation between buffer size and what happens now!!!

For a fully buffered output stream, the buffer size determines the amount of data written with a single write(2) call. For a line buffered output stream, a line is written with a single write(2) call as long as it doesn't exceed the buffer size.

If the file was open(2)ed with
O_APPEND, the file offset is first set to the end of the file
before writing. The adjustment of the file offset and the write
operation are performed as an atomic step.

See also these answers:

  • Atomicity of write(2) to a local filesystem
  • Understanding concurrent file writes from multiple processes

Multiple processes write a same CSV file, how to avoid conflict?

There is no direct way that I know.

One common workaround is to split the responsibility between "producers" and "outputter".

Get one more process responsible for outputting the CSV from a multiprocess queue and have all the "producers" process pushes to that queue.

I'd advise looking at python's multiprocessing module and especially the part about queues . If you're stuck when trying it, raise new questions here as this can become tricky.

Alternative is to use a "giant lock" which will require each process to wait for availability of the resource (using a system mutex for example). This will make the code easier but less scalable.

Understanding concurrent file writes from multiple processes

Atomicity of writes less than PIPE_BUF applies only to pipes and FIFOs. For file writes, POSIX says:

This volume of POSIX.1-2008 does not specify behavior of concurrent
writes to a file from multiple processes. Applications should use some
form of concurrency control.

...which means that you're on your own - different UNIX-likes will give different guarantees.

What happens if two python scripts want to write in the same file?

In general, this is not a good idea and will take a lot of care to get right. Since the writes will have to be serialized, it might also adversely affect scalability.

I'd recommend writing to separate files and merging (or just leaving them as separate files).

C - Multiple processes writing to the same log file

Q: "Do I have to introduce any synchronization (locking) logic for the log file?"

A: Yes. Writing simultaneously to the same file can produce race conditions and undesired behaviour.

Q: "Given that the "critical section" is that single write() call: can multiple write() calls on the same file descriptor be "mixed" because of the OS scheduling? Is that a real risk?"

A: Yes it is, and your example can happen.

To improve your code, open the log file once, and keep track of the file descriptor. Use a mutex lock inside writetolog.

I wrote a new version of writetolog with multiple parameter support (like printf):

Check Share condition variable & mutex between processes: does mutex have to locked before? for pthread_mutex_t _mutex_log_file initialization

#MAX_LEN_LOG_ENTRY 1024

// _log_fd is a file descriptor previously opened

void writetolog (char *fmt, ...)
{
va_list ap;
char msg[MAX_LEN_LOG_ENTRY];

va_start(ap, fmt);
vsnprintf(msg, MAX_LEN_LOG_ENTRY - 1, fmt, ap);
va_end(ap);

pthread_mutex_lock (&_mutex_log_file);

fprintf (_log_fd, "[ LOG ] %s\n", msg);
fflush(_log_fd);

pthread_mutex_unlock (&_mutex_log_file);
}

An example call of writetolog:

writetolog("Testing log function: %s %s %s", "hello", "world", "good");

Can multiple programs write to STDOUT at the same time?

In a shell environment, starting various jobs in the background, all writing to stdout, has a high chance of interleaving that output, as there is no lock on stdout.

However, GNU Parallel can redirect stdout for the various jobs it starts and prevent this interleaving. There are several commmand line switches and various options.

By default output is grouped:

--group
Group output. Output from each jobs is grouped together and is only printed when the command is finished. stderr (standard error) first followed by stdout (standard output). This takes some CPU time. In rare situations GNU parallel takes up lots of CPU time and if it is acceptable that the outputs from different commands are mixed together, then disabling grouping with -u can speedup GNU parallel by a factor of 10.

--group is the default. Can be reversed with -u.

But other options, including directing to files, are available.

Write to one output file from a few parallel LSF bsub jobs, avoiding writing at the same time

There's a couple of ways I can think of going about this:

  1. Have each job write its output to a different file (use $LSB_JOBID inside each job to name the file). Then use another "cleanup" job to concatenate all of the ouptut into a single file. You can use job dependencies (bsub -w) to make sure the cleanup job runs after all the other jobs are done.
  2. Implement a lock inside your "internal" job to make sure only one of them writes to a file at a time. This is a lot simpler than it might sound, one way to do it is to have each job try to create the same directory with mkdir before writing to the file and then delete the directory after its done. If they fail to create the directory it's because another one of the jobs got to it first and is currently writing to the file.

Here's a snippet illustrating #2 in bash:

# Try to get the lock every second
while ! mkdir lock &> /dev/null ; do
sleep 1
done

# Got the lock, write to the logfile
echo blahblahblah >> $logfile

# Release the lock
rmdir lock

I should mention an important caveat here though: if one of your jobs dies while it's "holding the lock" (say someone sends it a kill signal at the wrong time) then it'll never remove the directory and all the other jobs won't be able to create it, so they'll just keep sleeping forever.



Related Topics



Leave a reply



Submit