The Difference Between Wait_Queue_Head and Wait_Queue in Linux Kernel

The difference between wait_queue_head and wait_queue in linux kernel

From Linux Device Drivers:

The wait_queue_head_t type is a fairly simple structure, defined in
<linux/wait.h>. It contains only a lock variable and a linked list
of sleeping processes. The individual data items in the list are of
type wait_queue_t, and the list is the generic list defined in
<linux/list.h>.

Normally the wait_queue_t structures are allocated on the stack by
functions like interruptible_sleep_on; the structures end up in the
stack because they are simply declared as automatic variables in the
relevant functions. In general, the programmer need not deal with
them.

Take a look at A Deeper Look at Wait Queues part.

Some advanced applications, however, can require dealing with
wait_queue_t variables directly. For these, it's worth a quick look at
what actually goes on inside a function like interruptible_sleep_on.
The following is a simplified version of the implementation of
interruptible_sleep_on to put a process to sleep:
 void simplified_sleep_on(wait_queue_head_t *queue)
 {
   wait_queue_t wait;

   init_waitqueue_entry(&wait, current);
   current->state = TASK_INTERRUPTIBLE;

   add_wait_queue(queue, &wait);
   schedule();
   remove_wait_queue (queue, &wait);
  }
The code here creates a new wait_queue_t variable (wait, which gets
allocated on the stack) and initializes it. The state of the task is
set to TASK_INTERRUPTIBLE, meaning that it is in an interruptible
sleep. The wait queue entry is then added to the queue (the
wait_queue_head_t * argument). Then schedule is called, which
relinquishes the processor to somebody else. schedule returns only
when somebody else has woken up the process and set its state to
TASK_RUNNING. At that point, the wait queue entry is removed from the
queue, and the sleep is done

The internals of the data structures involved in wait queues:

Sample Image

Update:
for the users who think the image is my own - here is one more time the link to the Linux Device Drivers where the image is taken from

Difference between Semaphore and wait queue

Wait queues are event based mechanism and you wait for particular condition to be true.

They are not locks.

Semaphores are locks. You don't wait on certain condition to be true.

    Take lock 
    Process data
    Release lock.

wait queues and work queues, do they always go together?

As part of the probe callback, this driver initializes a work queue which is part of the driver’s private structure and adds itself to the Q. But I do not see any blocking of any kind anywhere.

I think you meant the wait queue head, not the work queue. I do not see any evidence of the probe adding itself to the queue; it is merely initializing the queue.

The queue is used by the calls to the wait_event_timeout() macro in the bcmgenet_mii_read() and bcmgenet_mii_write() functions in bcmmii.c. These calls will block until either the condition they are waiting for becomes true or the timeout period elapses. They are woken up by the wake_up(&priv->wq); call in the ISR0 interrupt handler.

Then it goes on to initialize a work queue with a function to call when woken up.

It is initializing a work item, not a work queue. The function will be called from a kernel thread as a result of the work item being added to the system work queue.

Now coming to the ISR0 for the driver, within that is an explicit call to the scheduler as part of the ISR (bcmgenet_isr0) if certain conditions are met. Now AFAIK, this call is used to defer work to a later time, much like a tasklet does.

You are referring to the schedule_work(&priv->bcmgenet_irq_work); call in the ISR0 interrupt handler. This is adding the previously mentioned work item to the system work queue. It is similar to as tasklet, but tasklets are run in a softirq context whereas work items are run in a process context.

Post this we check some MDIO status flags and if the conditions are met, we wake up the process which was blocked in process context. But where exactly is the process blocked?

As mentioned above, the process is blocked in the bcmgenet_mii_read() and bcmgenet_mii_write() functions, although they use a timeout to avoid blocking for long periods. (This timeout is especially important for those versions of GENET that do not support MDIO-related interrupts!)

Also, most of the time, wait queues seem to be used in conjunction with work queues. Is that the typical way to use them?

Not especially. This particular driver uses both a wait queue and a work item, but I wouldn't describe them as being used "in conjunction" since they are being used to handle different interrupt conditions.

Linux Kernel v.5. linux/wait.h WAIT_QUEUE_HEAD

In linux 3.5 source code, we can see that those functions were deprecated. Look at the comments above their declaration:

/*  
 * These are the old interfaces to sleep waiting for an event.  
 * They are racy.  DO NOT use them, use the wait_event* interfaces above.  
 * We plan to remove these interfaces.  
 */  
extern void sleep_on(wait_queue_head_t *q);  
extern long sleep_on_timeout(wait_queue_head_t *q,  
                      signed long timeout);  
extern void interruptible_sleep_on(wait_queue_head_t *q);  
extern long interruptible_sleep_on_timeout(wait_queue_head_t *q,  
                       signed long timeout);

The function to use instead are: wait_event_killable(), wait_event_timeout(), ...

When you call select(2) how does the kernel figure out a socket is ready?

https://eklitzke.org/how-tcp-sockets-work answered my question

When a new data packet comes in on the network interface (NIC), the kernel is notified either by being interrupted by the NIC, or by polling the NIC for data. Typically whether the kernel is interrupt driven or in polling mode depends on how much network traffic is happening; when the NIC is very busy it’s more efficient for the kernel to poll, but if the NIC is not busy CPU cycles and power can be saved by using interrupts. Linux calls this technique NAPI, literally “New API”.
When the kernel gets a packet from the NIC it decodes the packet and figures out what TCP connection the packet is associated with based on the source IP, source port, destination IP, and destination port. This information is used to look up the struct sock in memory associated with that connection. Assuming the packet is in sequence, the data payload is then copied into the socket’s receive buffer. At this point the kernel will wake up any processes doing a blocking read(2), or that are using an I/O multiplexing system call like select(2) or epoll_wait(2) to wait on the socket.

Why do we need to call poll_wait in poll?

poll_wait adds your device (represented by the "struct file") to the list of those that can wake the process up.

The idea is that the process can use poll (or select or epoll etc) to add a bunch of file descriptors to the list on which it wishes to wait. The poll entry for each driver gets called. Each one adds itself (via poll_wait) to the waiter list.

Then the core kernel blocks the process in one place. That way, any one of the devices can wake up the process. If you return non-zero mask bits, that means those "ready" attributes (readable/writable/etc) apply now.

So, in pseudo-code, it's roughly like this:

foreach fd:
    find device corresponding to fd
    call device poll function to setup wait queues (with poll_wait) and to collect its "ready-now" mask

while time remaining in timeout and no devices are ready:
    sleep

return from system call (either due to timeout or to ready devices)

nonexclusive wait queue add process to head while exclusive tail, why?

In exclusive case, only the first process will be woken up, so it must be the one waiting longest.

In non-exclusive all processes are going to be woken up, so the order does not matter and inserting to head is easier for singly-linked list.

Waiting for a periodic event with wait_event_interruptible

Interruptible version of sleep_on is interruptible_sleep_on. Note, that sleep-functions have been removed since kernel 3.15.

As for wait_event_interruptible, requirement I want it to always sleep when the ioctl is invoked. is uncommon for it. You may use a flag, but this flag should be per-process (or per-schedule slot). Or you may modify count for wait to be at least current_count + 1.

In such uncommon scenario, instead of macro wait_event_interruptible you may use blocks it consist of, and arrange them in the way you need. Generally, any waiting can be achived in that way.

The Difference Between Wait_Queue_Head and Wait_Queue in Linux Kernel