Which Os/Platforms Implement Wait Morphing Optimization

Which OS / platforms implement wait morphing optimization?

It's not supported on Linux. Mark Mossberg investigated it here, and it still holds true as of glibc master today (June 9, 2022).

unlock the mutex after condition_variable::notify_all() or before?

There are no advantages to unlocking the mutex before signaling the condition variable unless your implementation is unusual. There are two disadvantages to unlocking before signaling:

If you unlock before you signal, the signal may wake a thread that choose to block on the condition variable after you unlocked. This can lead to a deadlock if you use the same condition variable to signal more than one logical condition. This kind of bug is hard to create, hard to diagnose, and hard to understand. It is trivially avoided by always signaling before unlocking. This ensures that the change of shared state and the signal are an atomic operation and that race conditions and deadlocks are impossible.
There is a performance penalty for unlocking before signaling that is avoided by unlocking after signaling. If you signal before you unlock, a good implementation will know that your signal cannot possibly render any thread ready-to-run because the mutex is held by the calling thread and any thread affects by the condition variable necessarily cannot make forward progress without the mutex. This permits a significant optimization (often called "wait morphing") that is not possible if you unlock first.

So signal while holding the lock unless you have some unusual reason to do otherwise.

Why does Python threading.Condition() notify() require a lock?

This is not a definitive answer, but it's supposed to cover the relevant details I've managed to gather about this problem.

First, Python's threading implementation is based on Java's. Java's Condition.signal() documentation reads:

An implementation may (and typically does) require that the current thread hold the lock associated with this Condition when this method is called.

Now, the question was why enforce this behavior in Python in particular. But first I want to cover the pros and cons of each approach.

As to why some think it's often a better idea to hold the lock, I found two main arguments:

From the minute a waiter acquire()s the lock—that is, before releasing it on wait()—it is guaranteed to be notified of signals. If the corresponding release() happened prior to signalling, this would allow the sequence(where P=Producer and C=Consumer) P: release(); C: acquire(); P: notify(); C: wait() in which case the wait() corresponding to the acquire() of the same flow would miss the signal. There are cases where this doesn't matter (and could even be considered to be more accurate), but there are cases where that's undesirable. This is one argument.
When you notify() outside a lock, this may cause a scheduling priority inversion; that is, a low-priority thread might end up taking priority over a high-priority thread. Consider a work queue with one producer and two consumers (LC=Low-priority consumer and HC=High-priority consumer), where LC is currently executing a work item and HC is blocked in wait().

The following sequence may occur:

P                    LC                    HC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                     execute(item)                   (in wait())
lock()                                  
wq.push(item)
release()
                     acquire()
                     item = wq.pop()
                     release();
notify()
                                                     (wake-up)
                                                     while (wq.empty())
                                                       wait();

Whereas if the notify() happened before release(), LC wouldn't have been able to acquire() before HC had been woken-up. This is where the priority inversion occurred. This is the second argument.

The argument in favor of notifying outside of the lock is for high-performance threading, where a thread need not go back to sleep just to wake-up again the very next time-slice it gets—which was already explained how it might happen in my question.

Python's `threading` Module

In Python, as I said, you must hold the lock while notifying. The irony is that the internal implementation does not allow the underlying OS to avoid priority inversion, because it enforces a FIFO order on the waiters. Of course, the fact that the order of waiters is deterministic could come in handy, but the question remains why enforce such a thing when it could be argued that it would be more precise to differentiate between the lock and the condition variable, for that in some flows that require optimized concurrency and minimal blocking, acquire() should not by itself register a preceding waiting state, but only the wait() call itself.

Arguably, Python programmers would not care about performance to this extent anyway—although that still doesn't answer the question of why, when implementing a standard library, one should not allow several standard behaviors to be possible.

One thing which remains to be said is that the developers of the threading module might have specifically wanted a FIFO order for some reason, and found that this was somehow the best way of achieving it, and wanted to establish that as a Condition at the expense of the other (probably more prevalent) approaches. For this, they deserve the benefit of the doubt until they might account for it themselves.

Which Os/Platforms Implement Wait Morphing Optimization