How Pthread_Mutex_Lock Is Implemented

What is the `pthread_mutex_lock()` wake order with multiple threads waiting?

When the mutex becomes available, does the first thread that called pthread_mutex_lock() get the lock?

No. One of the waiting threads gets a lock, but which one gets it is not determined.

FIFO order?

FIFO mutex is rather a pattern already. See Implementing a FIFO mutex in pthreads

What's the difference between pthread_mutex_lock and kernel mutex_lock in linux?

There's no direct relationship.

pthread_mutex_lock() is a userspace API, implemented in the C library. On Linux, it's usually based on the kernel futex() system call.

mutex_lock() is an internal kernel API, implemented within the kernel itself and only available there. It's based around spinlocks and direct manipulation of the current task's schedulable state, usually with architecture-optimised fast paths.

It makes no sense to compare the performance because they are not interchangeable - where you can use one, you cannot use the other and vice-versa.

Does this implementation of mutex locks result in undefined behavior?

Answering your questions,

If main tries to lock lock[0] twice it should deadlock.

Yes, it would. Unless you use recursive mutexes, but then your child thread would never be able to lock the mutex as main would always have it locked.

extra unlocking lock[0], which was locked by main, should be undefined behavior.

Per the POSIX documentation for pthread_mutex_unlock(), this is undefined behavior for a NORMAL and non-robust mutex. However, the DEFAULT mutex does not have to be NORMAL and non-robust so there is this caveat:

If the mutex type is PTHREAD_MUTEX_DEFAULT, the behavior of pthread_mutex_lock() [and pthread_mutex_unlock()] may correspond to one of the three other standard mutex types as described in the table above. If it does not correspond to one of those three, the behavior is undefined for the cases marked.

(Note my addition of pthread_mutex_unlock(). The table of mutex behavior clearly shows that unlock behavior for a non-owner varies between different types of mutexes and even uses the same "dagger" mark in the "Unlock When Not Owner" column as used in the "Relock" column, and the "dagger" mark refers to the footnote I quoted.)

A robust NORMAL, ERRORCHECK, or RECURSIVE mutex will return an error if a non-owning thread attempts to unlock it, and the mutex remains locked.

A simpler solution is to use a pair of semaphores (the following code is deliberately missing error checking along with empty lines that would otherwise increase readability in order to eliminate/reduce any vertical scroll bar):

#include <semaphore.h>
#include <pthread.h>
#include <stdio.h>
sem_t main_sem;
sem_t child_sem;
void *child( void *arg )
{
    for ( ;; )
    {
        sem_wait( &child_sem );
        sleep( 2 );
        sem_post( &main_sem );
    }
    return( NULL );
}
int main( int argc, char **argv )
{
    pthread_t child_tid;
    sem_init( &main_sem, 0, 0 );
    sem_init( &child_sem, 0, 0 );
    pthread_create( &child_tid, NULL, child, NULL );
    int x = 0;
    for ( ;; )
    {
        // tell the child thread to go
        sem_post( &child_sem );
        // wait for the child thread to finish one iteration
        sem_wait( &main_sem );
        x++;
        printf("%d\n", x);
    }
    pthread_join( child_tid, NULL );
}

How pthread_once() is implemented internally?

The spec does not define how pthread_once and pthread_mutex_lock must be implemented, but only how they must behave, so different platforms will have different implementations.

It is generally possible to make pthread_once simpler than a mutex (since all it requires is an atomic test-and-set operation, and no blocking), but I would also suspect that pthread_mutex_lock likely received more optimization because it is much more widely used.

If you care about performance, you will have to write a benchmark and run it on the platform you are targeting, and choose the one that's faster.

C Confused on how to initialize and implement a pthread mutex and condition variable

If you want to use the PTHREAD_XXX_INITIALIZER macros you should use them in the variable declaration. Also use PTHREAD_COND_INITIALIZER for condition variables:

// Locks & Condition Variables
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; // Lock shared resources among theads
pthread_cond_t full  = PTHREAD_COND_INITIALIZER;  // Condition indicating queue is full
pthread_cond_t empty = PTHREAD_COND_INITIALIZER; // Condition indicating queue is empty

Don't use those macros to initialize the mutex or condition variable later. If you need to do it later (for example if the object is dynamically allocated), use the appropriate init function:

pthread_mutex_init( &lock, NULL);
pthread_cond_init( &full, NULL);
pthread_cond_init( &empty, NULL);

To check a condition variable you must use a loop in order to avoid spurious unblocks and you must lock the mutex when:

checking the condition
changing the state that indicates the current condition
calling pthread_cond_wait()

So whatever is waiting for an is-empty condition might look like:

pthread_mutex_lock(&lock);
while (!isEmpty) {
    pthread_cond_wait(&empty, &lock);
}

// isEmpty is non-zero and the lock is held

Whatever is signalling that something is-empty might look like:

pthread_mutex_lock(&lock);

// ...
// we have emptied the queue while holding the lock

isEmpty = 1;
pthread_mutex_unlock(&lock);
pthread_cond_signal(&empty);

overhead of pthread_mutex_lock and pthread_mutex_unlock

There are surely similar questions and answers here on SO, but I will provide a couple of info points here.

First, usually, the biggest cost of a mutex is if at least 2 threads are hammering the mutex hard. An uncontended mutex is not expensive, essentially it can be implemented in terms of an atomic flag.

An additional fact is that mutexes come with barriers to implement e.g. sequential consistency. Roughly, if another thread running on another CPU core reads data written by one thread in the critical section, that data must be published over the bus at mutex unlock to make sure the other processors/CPU cores' caches see the data.