On Linux, How to Make Sure to Unlock a Mutex Which Was Locked in a Thread That Dies/Terminates

On linux, how to make sure to unlock a mutex which was locked in a thread that dies/terminates?

A robust mutex can be used to handle the case where the owner of the mutex is terminated while holding the mutex lock, so that a deadlock does not occur. These have more overhead than a regular mutex, and require that all clients locking the mutex be prepared to handle the error code EOWNERDEAD. This indicates that the former owner has died and that the client receiving this error code is the new owner and is responsible for cleaning up any inconsistent state.

A robust mutex is a mutex with the robust attribute set. It is set using the POSIX.1-2008 standard function pthread_mutexattr_setrobust(&attr, PTHREAD_MUTEX_ROBUST).

Further details and example code can be found on the Linux manual page for pthread_mutexattr_getrobust.

What to do about a global mutex locked from a thread that then is killed

As @meaning-matters has suggested, this should be possible, at least on Linux, by intercepting the real-time signal 32 and using pthread_cancel, which is implemented using signals.

Force unlock a mutex that was locked by a different thread

I have come up with a workable method to deal with this situation. As I mentioned before, FreeBSD does not support robust mutexes so that option is out. Also one a thread has locked a mutex, it cannot be unlocked by any means.

So what I have done to solve the problem is to abandon the mutex and place its pointer onto a list. Since the lock wrapper code uses pthread_mutex_trylock and then relinquishes the CPU if it fails, no thread can get stuck on waiting for a permanently locked mutex. In the case of a robust mutex, the thread locking the mutex will be able recover it if it gets EOWNERDEAD as the return code.

Here's some things that are defined:

/* Checks to see if we have access to robust mutexes. */
#ifndef PTHREAD_MUTEX_ROBUST
#define TSRA__ALTERNATE
#define TSRA_MAX_MUTEXABANDON   TSRA_MAX_MUTEX * 4
#endif

/* Mutex: Mutex Data Table Datatype */
typedef struct mutex_lock_table_tag__ mutexlock_t;
struct mutex_lock_table_tag__
  {
    pthread_mutex_t *mutex;     /* PThread Mutex */
    tsra_daclbk audcallbk;      /* Audit Callback Function Pointer */
    tsra_daclbk reicallbk;      /* Reinit Callback Function Pointer */
    int acbkstat;               /* Audit Callback Status */
    int rcbkstat;               /* Reinit Callback Status */
    pthread_t owner;            /* Owner TID */
    #ifdef TSRA__OVERRIDE
    tsra_clnup_t *cleanup;      /* PThread Cleanup */
    #endif
  };

/* ******** ******** Global Variables */

pthread_rwlock_t tab_lock;              /* RW lock for mutex table */
pthread_mutexattr_t mtx_attrib;         /* Mutex attributes */
mutexlock_t *mutex_table;               /* Mutex Table */
int tabsizeentry;                       /* Table Size (Entries) */
int tabsizebyte;                        /* Table Size (Bytes) */
int initialized = 0;                    /* Modules Initialized 0=no, 1=yes */
#ifdef TSRA__ALTERNATE
pthread_mutex_t *mutex_abandon[TSRA_MAX_MUTEXABANDON];
pthread_mutex_t mtx_abandon;            /* Abandoned Mutex Lock */
int mtx_abandon_count;                  /* Abandoned Mutex Count */
int mtx_abandon_init = 0;               /* Initialization Flag */
#endif
pthread_mutex_t mtx_recover;            /* Mutex Recovery Lock */

And here's some code for the lock recovery:

/* Attempts to recover a broken mutex. */
int tsra_mutex_recover(int lockid, pthread_t tid)
  {
    int result;

    /* Check Prerequisites */
    if (initialized == 0) return(EDOOFUS);
    if (lockid < 0 || lockid >= tabsizeentry) return(EINVAL);

    /* Check Mutex Owner */
    result = pthread_equal(tid, mutex_table[lockid].owner);
    if (result != 0) return(0);

    /* Lock Recovery Mutex */
    result = pthread_mutex_lock(&mtx_recover);
    if (result != 0) return(result);

    /* Check Mutex Owner, Again */
    result = pthread_equal(tid, mutex_table[lockid].owner);
    if (result != 0)
      {
        pthread_mutex_unlock(&mtx_recover);
        return(0);
      }

    /* Unless the system supports robust mutexes, there is
       really no way to recover a mutex that is being held
       by a thread that has terminated.  At least in FreeBSD,
       trying to destory a mutex that is held will result
       in EBUSY.  Trying to overwrite a held mutex results
       in a memory fault and core dump.  The only way to
       recover is to abandon the mutex and create a new one. */
    #ifdef TSRA__ALTERNATE      /* Abandon Mutex */
    pthread_mutex_t *ptr;

    /* Too many abandoned mutexes? */
    if (mtx_abandon_count >= TSRA_MAX_MUTEXABANDON)
      {
        result = TSRA_PROGRAM_ABORT;
        goto error_1;
      }

    /* Get a read lock on the mutex table. */
    result = pthread_rwlock_rdlock(&tab_lock);
    if (result != 0) goto error_1;

    /* Perform associated data audit. */
    if (mutex_table[lockid].acbkstat != 0)
      {
        result = mutex_table[lockid].audcallbk();
        if (result != 0)
          {
            result = TSRA_PROGRAM_ABORT;
            goto error_2;
          }
      }

    /* Allocate New Mutex */
    ptr = malloc(sizeof(pthread_mutex_t));
    if (ptr == NULL)
      {
        result = errno;
        goto error_2;
      }

    /* Init new mutex and abandon the old one. */
    result = pthread_mutex_init(ptr, &mtx_attrib);
    if (result != 0) goto error_3;
    mutex_abandon[mtx_abandon_count] = mutex_table[lockid].mutex;
    mutex_abandon[mtx_abandon_count] = mutex_table[lockid].mutex;
    mtx_abandon_count++;
    mutex_table[lockid].mutex = ptr;

    #else       /* Recover Mutex */

    /* Try locking the mutex and see what we get. */
    result = pthread_mutex_trylock(mutex_table[lockid].mutex);
    switch (result)
      {
        case 0:                 /* No error, unlock and return */
          pthread_unlock_mutex(mutex_table[lockid].mutex);
          return(0);
          break;
        case EBUSY:             /* No error, return */
          return(0);
          break;
        case EOWNERDEAD:        /* Error, try to recover mutex. */
          if (mutex_table[lockid].acbkstat != 0)
              {
                result = mutex_table[lockid].audcallbk();
                if (result != 0)
                  {
                    if (mutex_table[lockid].rcbkstat != 0)
                        {
                          result = mutex_table[lockid].reicallbk();
                          if (result != 0)
                            {
                              result = TSRA_PROGRAM_ABORT;
                              goto error_2;
                            }
                        }
                      else
                        {
                          result = TSRA_PROGRAM_ABORT;
                          goto error_2;
                        }
                  }
              }
            else
              {
                result = TSRA_PROGRAM_ABORT;
                goto error_2;
              }
          break;
        case EDEADLK:           /* Error, deadlock avoided, abort */
        case ENOTRECOVERABLE:   /* Error, recovery failed, abort */
          /* NOTE: We shouldn't get this, but if we do... */
          abort();
          break;
        default:
          /* Ambiguous situation, best to abort. */
          abort();
          break;
      }
    pthread_mutex_consistant(mutex_table[lockid].mutex);
    pthread_mutex_unlock(mutex_table[lockid].mutex);
    #endif

    /* Housekeeping */
    mutex_table[lockid].owner = pthread_self();
    pthread_mutex_unlock(&mtx_recover);

    /* Return */
    return(0);

    /* We only get here on errors. */
    #ifdef TSRA__ALTERNATE
    error_3:
    free(ptr);
    error_2:
    pthread_rwlock_unlock(&tab_lock);
    #else
    error_2:
    pthread_mutex_unlock(mutex_table[lockid].mutex);
    #endif
    error_1:
    pthread_mutex_unlock(&mtx_recover);
    return(result);
  }

Because FreeBSD is an evolving operating system like Linux is, I have made provisions to allow for the use of robust mutexes in the future. Since without robust mutexes, there really is no way to do enhanced error checking which is available if robust mutexes are supported.

For a robust mutex, enhanced error checking is performed to verify the need to recover the mutex. For systems that do not support robust mutexes, we have to trust the caller to verify that the mutex in question needs to be recovered. Besides, there is some checking to make sure that there is only one thread performing the recovery. All other threads blocking on the mutex are blocked. I have given some thought about how to signal other threads that a recovery is in progress, so that aspect of the routine still needs work. In a recovery situation, I'm thinking about comparing pointer values to see if the mutex was replaced.

In both cases, an audit routine can be set as a callback function. The purpose of the audit routine is to verify and correct any data discrepancies in the protected data. If the audit fails to correct the data, then another callback routine, the data reinitialize routine, is invoked. The purpose of this is to reinitialize the data that is protected by the mutex. If that fail, then abort() is called to terminate program execution and drop a core file for debugging purposes.

For the abandoned mutex case, the pointer is not thrown away, but is placed on a list. If too many mutexes are abandoned, then the program is aborted. As mentioned above, in the mutex lock routine, pthread_mutex_trylock is used instead of pthread_mutex_lock. This way, no thread can be permanently blocked on a dead mutex. So once the pointer is switched in the mutex table to point to the new mutex, all threads waiting on the mutex will immediately switch to the new mutex.

I am sure there are bugs/errors in this code, but this is a work in progress. Although not quite finished and debugged, I feel that there is enough here to warrant an answer to this question.

Cancelling a thread that has a mutex locked does not unlock the mutex

It's correct that cancelled threads do not unlock mutexes they hold, you need to arrange for that to happen manually, which can be tricky as you need to be very careful to use the right cleanup handlers around every possible cancellation point. Assuming you're using pthread_cancel to cancel the thread and setting cleanup handlers with pthread_cleanup_push to unlock the mutexes, there are a couple of alternatives you could try which might be simpler to get right and so may be more reliable.

Using RAII to unlock the mutex will be more reliable. On GNU/Linux pthread_cancel is implemented with a special exception of type __cxxabi::__forced_unwind, so when a thread is cancelled an exception is thrown and the stack is unwound. If a mutex is locked by an RAII type then its destructor will be guaranteed to run if the stack is unwound by a __forced_unwind exception. Boost Thread provides a portable C++ library that wraps Pthreads and is much easier to use. It provides an RAII type boost::mutex and other useful abstractions. Boost Thread also provides its own "thread interruption" mechanism which is similar to Pthread cancellation but not the same, and Pthread cancellation points (such as connect) are not Boost Thread interruption points, which can be helpful for some applications. However in your client's case since the point of cancellation is to interrupt the connect call they probably do want to stick with Pthread cancellation. The (non-portable) way GNU/Linux implements cancellation as an exception means it will work well with boost::mutex.

There is really no excuse for explicitly locking and unlocking mutexes when you're writing in C++, IMHO the most important and most useful feature of C++ is destructors which are ideal for automatically releasing resources such as mutex locks.

Another option would be to use a robust mutex, which is created by calling pthread_mutexattr_setrobust on a pthread_mutexattr_t before initializing the mutex. If a thread dies while holding a robust mutex the kernel will make a note of it so that the next thread which tries to lock the mutex gets the special error code EOWNERDEAD. If possible, the new thread can make the data protected by the thread consistent again and take ownership of the mutex. This is much harder to use correctly than simply using an RAII type to lock and unlock the mutex.

A completely different approach would be to decide if you really need to hold the mutex lock while calling connect. Holding mutexes during slow operations is not a good idea. Can't you call connect then if successful lock the mutex and update whatever shared data is being protected by the mutex?

My preference would be to both use Boost Thread and avoid holding the mutex for long periods.

How to come out of a deadlock in linux

You can set the ROBUST attribute on a mutex. With a robust mutex, if the thread that acquired it exits for some reason without unlocking it, the mutex enters a special state where the next thread that attempts to lock it will get EOWNERDEAD.

It is then the responsibility of that thread to cleanup any inconsistent state. If recovery is possible, the thread shall call pthread_mutex_consistent(3) any time before pthread_mutex_unlock(3), so that the other threads can use it as before. If recovery is not possible, the mutex should be unlocked without calling pthread_mutex_consistent(3), causing it to enter an unusable state where the only permissible operation is to destroy it.

Note that the mutex is locked even if EOWNERDEAD was returned (I think it's the only condition under which pthread_mutex_lock(3) returns with an error but locks the mutex).

To set the ROBUST attribute, use pthread_mutexattr_setrobust(3) after initializing the mutex attributes instance. Remember that this must be done before initializing the mutex. So, something like:

pthread_mutex_t mutex;
pthread_mutexattr_t mutex_attrs;

if (pthread_mutexattr_init(&mutex_attrs) != 0) {
    /* Handle error... */
}
if (pthread_mutexattr_setrobust(&mutex_attrs, PTHREAD_MUTEX_ROBUST) != 0) {
    /* Handle error... */
}
if (pthread_mutex_init(&mutex, &mutex_attrs) != 0) {
    /* Handle error... */
}

Then you can use it like:

int lock_res = pthread_mutex_lock(&mutex);

if (lock_res == EOWNERDEAD) {
    /* Someone died before unlocking the mutex
     * We assume there's no cleanup to do
     */
    if (pthread_mutex_consistent(&mutex) != 0) {
        /* Handle error... */
    }
} else if (lock_res != 0) {
    /* Some other error, handle it here */
}

/* mutex is locked here, do stuff... */

if (pthread_mutex_unlock(&mutex) != 0) {
    /* Handle error */
}

For more info you can see the manpage for pthread_mutex_consistent(3) and pthread_mutex_getrobust(3) / pthread_mutex_setrobust(3)

Kill thread during it's waiting for pthread_mutex_lock

Do not use thread cancellation if you can possibly avoid it. There are quite a few issues and gotchas to contend with.

One class of issues has to do with resource cleanup. Although POSIX defines a mechanism for registering cleanup handlers to address this issue, it takes a great deal of painstaking work to ensure that all resources -- allocated memory, open files, mutexes and semaphores, general shared state -- are properly cleaned up by those means. It is very easy for a thread to, say, fail to unlock a mutex it holds locked when it is canceled, thus deadlocking each and every thread that subsequently attempts unconditionally to lock it.

Another, more subtle class of issues has to do with at what points a thread actually can be cancelled. By default, a thread with a pending cancellation signal will terminate when it next reaches a cancellation point or if it is already blocked at a cancellation point. A fair number of POSIX functions either definitely are or may be cancellation points, but not all.

In particular, pthread_mutex_lock() is not a cancellation point. Thus, if you cancel a thread that is blocked in pthread_mutex_lock, it will not immediately be canceled. In principle, it might successfully lock the mutex and then proceed until it reaches a cancellation point, or it might return without locking the mutex (with a non-zero return code to indicate the nature of the error). Either could cause trouble for you, but the former seems especially poised to set you up for a deadlock. In practice, pthread_mutex_lock() is documented to not return EINTR, leading me to expect that the former alternative will be exhibited: cancellation requests will not cause a thread blocked in pthread_mutex_lock() to terminate without acquiring the mutex and returning.

thread exit but still hold mutex

You are perhaps thinking of the robust attribute of mutexes (pthread_mutexattr_setrobust()), rather than of the errorcheck type of mutex. A robust mutex would have notified your main thread that the holder of the mutex's lock had terminated with EOWNERDEAD.

The PTHREAD_MUTEX_ERRORCHECK type, on the other hand, simply guards against three kinds of errors:

attempting to recursively lock one's own locked mutex (not applicable here)
attempting to unlock a mutex locked by another thread (not applicable here)
attempting to unlock an unlocked mutex (not applicable here)

Pthreads dying in the middle of a mutex lock

Killing threads is never very useful. (unless you can afford to SIGKILL/abort the whole process anyway).

Instead unwind the stack with an exception and use RAII. If your process/OS has become so unstable that random thread aborts happen, I think you'll have other worries and the resulting mess is not the process' responsibility.

Just Don't `pthread_kill`

Also, all of this might become slightly more interesting when using inter-process sync primitives. In that case, though, I think Linux kernels guarantee that any locks held by a process will be released when that process is terminated, whatever the cause

Can exiting from a process that is locking a mutex cause a deadlock?

If the mutex object was owned be the exiting process (either by means of create or open) its handle will be closed upon termination of the process.

The other processes wait operation will return on such ocasion:

For Windows, i.e. WaitForSingleObject(...) returns WAIT_ABANDONED which means:

The specified object is a mutex object that was not released by the thread that owned the mutex object before the owning thread terminated. Ownership of the mutex object is granted to the calling thread and the mutex state is set to nonsignaled. If the mutex was protecting persistent state information, you should check it for consistency.

For Linux, i.e. pthread_mutex_lock(...) returns EINVAL which means:

The value specified by mutex does not refer to an initialised mutex object.

On Linux, How to Make Sure to Unlock a Mutex Which Was Locked in a Thread That Dies/Terminates