How to Log While Using Multiprocessing in Python

How should I log while using multiprocessing in Python?

The only way to deal with this non-intrusively is to:

  1. Spawn each worker process such that its log goes to a different file descriptor (to disk or to pipe.) Ideally, all log entries should be timestamped.
  2. Your controller process can then do one of the following:

    • If using disk files: Coalesce the log files at the end of the run, sorted by timestamp
    • If using pipes (recommended): Coalesce log entries on-the-fly from all pipes, into a central log file. (E.g., Periodically select from the pipes' file descriptors, perform merge-sort on the available log entries, and flush to centralized log. Repeat.)

Does python logging support multiprocessing?

As Matino correctly explained: logging in a multiprocessing setup is not safe, as multiple processes (who do not know anything about the other ones existing) are writing into the same file, potentially intervening with each other.

Now what happens is that every process holds an open file handle and does an "append write" into that file. The question is under what circumstances the append write is "atomic" (that is, cannot be interrupted by e.g. another process writing to the same file and intermingling his output). This problem applies to every programming language, as in the end they'll do a syscall to the kernel. This answer answers under which circumstances a shared log file is ok.

It comes down to checking your pipe buffer size, on linux that is defined in /usr/include/linux/limits.h and is 4096 bytes. For other OSes you find here a good list.

That means: If your log line is less than 4'096 bytes (if on Linux), then the append is safe, if the disk is directly attached (i.e. no network in between). But for more details please check the first link in my answer. To test this you can do logger.info('proc name %s id %s %s' % (proc.name, proc.pid, str(proc.name)*5000)) with different lenghts. With 5000 for instance I got already mixed up log lines in /tmp/test.log.

In this question there are already quite a few solutions to this, so I won't add my own solution here.

Update: Flask and multiprocessing

Web frameworks like flask will be run in multiple workers if hosted by uwsgi or nginx. In that case, multiple processes may write into one log file. Will it have problems?

The error handling in flask is done via stdout/stderr which is then cought by the webserver (uwsgi, nginx, etc.) which needs to take care that logs are written in correct fashion (see e.g. [this flask+nginx example])(http://flaviusim.com/blog/Deploying-Flask-with-nginx-uWSGI-and-Supervisor/), probably also adding process information so you can associate error lines to processes. From flasks doc:

By default as of Flask 0.11, errors are logged to your webserver’s log automatically. Warnings however are not.

So you'd still have this issue of intermingled log files if you use warn and the message exceeds the pipe buffer size.

Python: How to use a different logfile for each item in multiprocessing.Pool?

To use a separate logfile for every processed item (of the iterable you pass to pool.map()) from inside the workers, nothing multiprocessing-specific is really needed. But you can
can use Pool's initializer() to set up the root-loggers inside the workers and store some meta-data for logging. Under the hood the arguments to Pool(initializer) and Pool(initargs) end up being passed to Process(target) and Process(args) for creating new worker-processes.
Then you only have to exchange the Filehandler for every item processed with log_on_disk() inside the workers.

import logging
import multiprocessing as mp

def log_on_disk(x):
logger = _init_logger(file_id=x)
logger.info(f"hello world")

def _init_logging(level=logging.INFO, mode='a'):
fmt = logging.Formatter(
'%(asctime)s %(processName)-10s %(name)s %(levelname)-8s --- %(message)s'
)
logger = logging.getLogger()
logger.setLevel(level)
globals()['_log_meta'] = {'mode': mode, 'fmt': fmt}

def _init_logger(file_id):
logger = logging.getLogger()
logger.handlers = [] # remove all handlers
fh = logging.FileHandler(f"{file_id}.log", mode=_log_meta['mode'])
fh.setFormatter(_log_meta['fmt'])
logger.addHandler(fh)
return logger

if __name__ == '__main__':

with mp.Pool(5, initializer=_init_logging, initargs=(logging.DEBUG,)) as pool:
pool.map(log_on_disk, ["ping", "pong", "foo", "bar"])

Python: How to use different logfiles for processes in multiprocessing.Pool?

You'll need to use Pool's initializer() to set up and register the separate loggers immediately after workers start up. Under the hood the arguments to Pool(initializer) and Pool(initargs) end up being passed to Process(target) and Process(args) for creating new worker-processes...

Pool-workers get named in the format {start_method}PoolWorker-{number}, so e.g. SpawnWorker-1 if you use spawn as starting method for new processes.
The file number for the logfiles then can be extracted from the assigned worker-names with mp.current_process().name.split('-')[1].

import logging
import multiprocessing as mp

def f(x):
logger.info(f"x*x={x*x}")
return x*x

def _init_logging(level=logging.INFO, mode='a'):
worker_no = mp.current_process().name.split('-')[1]
filename = f"log{worker_no}.log"
fh = logging.FileHandler(filename, mode=mode)
fmt = logging.Formatter(
'%(asctime)s %(processName)-10s %(name)s %(levelname)-8s --- %(message)s'
)
fh.setFormatter(fmt)
logger = logging.getLogger()
logger.addHandler(fh)
logger.setLevel(level)
globals()['logger'] = logger

if __name__ == '__main__':

with mp.Pool(5, initializer=_init_logging, initargs=(logging.DEBUG,)) as pool:
print(pool.map(f, range(10)))

Note, due to the nature of multiprocessing, there's no guarantee for the exact number of files you end up with in your small example.
Since multiprocessing.Pool (contrary to concurrent.futures.ProcessPoolExecutor) starts workers as soon as you create the instance, you're bound to get the specified Pool(process)-number of files, so in your case 5. Actual thread/process-scheduling by your OS might cut this number short here, though.

Logging using multiprocessing

from the python docs: logging to a single file from multiple processes is not supported, because there is no standard way to serialize access to a single file across multiple processes in Python.

See: http://docs.python.org/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes

BTW: What I do in this situation, is use Scribe which is a distributed logging aggregator, that I log to via TCP. This allows me to log all servers I have to the same place, not just all processes.

See this project: http://pypi.python.org/pypi/ScribeHandler

Python Multiprocessing returning results with Logging and running frozen on Windows

The basic

On Linux, a child process is created by fork method by default. That means, the child process inherits almost everything from the parent process.

On Windows, the child process is created by spawn method.
That means, a child process is started almost from crash, re-imports and re-executes any code that is outside of the guard cloud if __name__ == '__main__'.

Why it worked or failed

On Linux, because the logger object is inherited, your program will start logging.
But it is far from perfect since you log directly to the file.
Sooner or later, log lines will be overlapped or IO error on file happens due to race condition between processes.

On Windows, since you didn't pass the logger object to the child process, and it re-imports your pymp_global module, logger is a None object. So when you try logging with a None object, it crashes for sure.

The solution

Logging with multiprocessing is not an easy task.
For it to work on Windows, you must either pass a logger object to child processes and/or log with QueueHandler. Another similar solution for inter-process communication is to use SocketHandler.

The idea is that only one thread or process does the logging. Other processes just send the log records. This prevents the race condition and ensures the log is written out after the critical process got time to do its job.

So how to implement it?

I have encountered this logging problem before and already written the code.

You can just use it with logger-tt package.

#pymp.py
from logging import getLogger
from logger_tt import setup_logging

setup_logging(use_multiprocessing=True)
logger = getLogger(__name__)
# other code below

For other modules

#pymp_common.py
from logging import getLogger

logger = getLogger(__name__)
# other code below

This saves you from writing all the logging config code everywhere manually.
You may consider changing the log_config file to suit your need.



Related Topics



Leave a reply



Submit