What Is the Use of Join() in Python Threading

What is the use of join() in Python threading?

A somewhat clumsy ascii-art to demonstrate the mechanism:
The join() is presumably called by the main-thread. It could also be called by another thread, but would needlessly complicate the diagram.

join-calling should be placed in the track of the main-thread, but to express thread-relation and keep it as simple as possible, I choose to place it in the child-thread instead.

without join:
+---+---+------------------ main-thread
| |
| +........... child-thread(short)
+.................................. child-thread(long)

with join
+---+---+------------------***********+### main-thread
| | |
| +...........join() | child-thread(short)
+......................join()...... child-thread(long)

with join and daemon thread
+-+--+---+------------------***********+### parent-thread
| | | |
| | +...........join() | child-thread(short)
| +......................join()...... child-thread(long)
+,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, child-thread(long + daemonized)

'-' main-thread/parent-thread/main-program execution
'.' child-thread execution
'#' optional parent-thread execution after join()-blocked parent-thread could
continue
'*' main-thread 'sleeping' in join-method, waiting for child-thread to finish
',' daemonized thread - 'ignores' lifetime of other threads;
terminates when main-programs exits; is normally meant for
join-independent tasks

So the reason you don't see any changes is because your main-thread does nothing after your join.
You could say join is (only) relevant for the execution-flow of the main-thread.

If, for example, you want to concurrently download a bunch of pages to concatenate them into a single large page, you may start concurrent downloads using threads, but need to wait until the last page/thread is finished before you start assembling a single page out of many. That's when you use join().

When, why, and how to call thread.join() in Python?

Short answer: this one:

for t in ts:
t.join()

is generally the idiomatic way to start a small number of threads. Doing .join means that your main thread waits until the given thread finishes before proceeding in execution. You generally do this after you've started all of the threads.

Longer answer:

len(list(range(500001, 500000*2, 100)))
Out[1]: 5000

You're trying to start 5000 threads at once. It's miraculous your computer is still in one piece!

Your method of .join-ing in the loop that dispatches workers is never going to be able to have more than 2 threads (i.e. only one worker thread) going at once. Your main thread has to wait for each worker thread to finish before moving on to the next one. You've prevented a computer-meltdown, but your code is going to be WAY slower than if you'd just never used threading in the first place!

At this point I'd talk about the GIL, but I'll put that aside for the moment. What you need to limit your thread creation to a reasonable limit (i.e. more than one, less than 5000) is a ThreadPool. There are various ways to do this. You could roll your own - this is fairly simple with a threading.Semaphore. You could use 3.2+'s concurrent.futures package. You could use some 3rd party solution. Up to you, each is going to have a different API so I can't really discuss that further.


Obligatory GIL Discussion

cPython programmers have to live with the GIL. The Global Interpreter Lock, in short, means that only one thread can be executing python bytecode at once. This means that on processor-bound tasks (like adding a bunch of numbers), threading will not result in any speed-up. In fact, the overhead involved in setting up and tearing down threads (not to mention context switching) will result in a slowdown. Threading is better positioned to provide gains on I/O bound tasks, such as retrieving a bunch of URLs.

multiprocessing and friends sidestep the GIL limitation by, well, using multiple processes. This isn't free - data transfer between processes is expensive, so a lot of care needs to be made not to write workers that depend on shared state.

Use of threading.Thread.join()

A call to thread1.join() blocks the thread in which you're making the call, until thread1 is finished. It's like wait_until_finished(thread1).

For example:

import time

def printer():
for _ in range(3):
time.sleep(1.0)
print "hello"

thread = Thread(target=printer)
thread.start()
thread.join()
print "goodbye"

prints

hello
hello
hello
goodbye

—without the .join() call, goodbye would come first and then 3 * hello.

Also, note that threads in Python do not provide any additional performance (in terms of CPU processing power) because of a thing called the Global Interpreter Lock, so while they are useful for spawning off potentially blocking (e.g. IO, network) and time consuming tasks (e.g. number crunching) to keep the main thread free for other tasks, they do not allow you to leverage multiple cores or CPUs; for that, look at multiprocessing which uses subprocesses but exposes an API equivalent to that of threading.

PLUG: ...and it is also for the above reason that, if you're interested in concurrency, you might also want to look into a fine library called Gevent, which essentially just makes threading much easier to use, much faster (when you have many concurrent activities) and less prone to concurrency related bugs, while allowing you to keep coding the same way as with "real" threads. Also Twisted, Eventlet, Tornado and many others, are either equivalent or comparable. Furthermore, in any case, I'd strongly suggest reading these classics:

  • Generator Tricks for Systems Programmers
  • A Curious Course on Coroutines and Concurrency

What is the purpose of using Join with threading

You're right, if you call join() immediately after starting a thread, it defeats the purpose of having a thread, since now you've got a child-thread running but your main-thread is blocked until the child thread returns and therefore you still don't have any parallelism.

However, join() was not intended to be used that way. Instead, it's expected that you'll start() one or more threads, and then the main thread will either continue on doing (whatever it usually does) or alternatively it will then call join() on each of the launched threads in order to block until all of the threads have exited. In either of those two cases, you have still achieved effective parallelism (Python GIL notwithstanding).

The real purpose of join(), however, is to allow you free up resources safely. For one thing, there are some underlying resources associated with each thread (such as its return-value) that need to be retained in memory until join() (or detach()) is called, in case the parent thread wants to use them; more importantly, if the parent thread has allocated some resource that the child thread has access to, then it's generally not safe for the parent thread to free that resource until after the child thread has exited, since destroying it while the child thread is in the middle of using it would cause big problems for the child-thread.

Similarly, if the child thread is working on preparing some data for the parent thread to use, it's not safe for the parent thread to try to use that data until after the child thread has finished preparing it -- there's no point in trying to use half-constructed data.

Given that, it's common for the parent thread to call join() to wait until the child thread has exited before doing any cleanup work that would affect the child thread.

If the child thread isn't designed to automatically exit in a finite period of time, the main thread might request the child-thread to exit before it makes the join() call, e.g. by setting a boolean variable or writing a byte on a pipe, or etc, and the child thread would react to that by exiting, so that the join() call wouldn't block indefinitely.

What exactly does Thread.join() do in python? Is this incorrect usage of Thread.join()?

would the "calling thread" be whichever thread called the loadFunction function?

The calling thread is whichever thread called join. so t1.join() and t2.join() and t3.join() cause the main thread to block, and the join inside loadFunction would cause t3 to block, if map was not lazily evaluated.

how would I fix it?

Your joins inside loadFunction aren't executing because map does not execute any code until you iterate over it. As MaxNoe suggests, you should use an ordinary for loop instead.

def loadFunction(name, start, end, wait=[]):
""" wait should be a list of threads to wait for """
for t in wait:
t.join()
for number in range(start, end):
print("%s : %d" % (name, number))

What happens if I don't join() a python thread?

A Python thread is just a regular OS thread. If you don't join it, it still keeps running concurrently with the current thread. It will eventually die, when the target function completes or raises an exception. No such thing as "thread reuse" exists, once it's dead it rests in peace.

Unless the thread is a "daemon thread" (via a constructor argument daemon or assigning the daemon property) it will be implicitly joined for before the program exits, otherwise, it is killed abruptly.

One thing to remember when writing multithreading programs in Python, is that they only have limited use due to infamous Global interpreter lock. In short, using threads won't make your CPU-intensive program any faster. They can be useful only when you perform something involving waiting (e.g. you wait for certain file system event to happen in a thread).

python multi-thread join with timeout

Yes, you can calculate an absolute timeout, and recompute your remaining relative timeout before every join:

# Join the_threads, waiting no more than some_relative_timeout to
# attempt to join all of them.

absolute_timeout = time.time() + some_relative_timeout

for thr in the_threads:
timeout = absolute_timeout - time.time()
if timeout < 0:
break
thr.join(timeout)
if thr.isAlive():
# we timed out
else:
# we didn't

Now, whether or not you *should* do this is a bit opaque. It may be better to have "daemon" worker threads communicate their completion by other means: a global state table, pushes of "done" messages to a queue, etc.



Related Topics



Leave a reply



Submit