Queues in The Linux Kernel

Queues in the Linux Kernel

You're right, the Linux kernel typically uses linked lists to implement queues. This makes sense, because linked lists offer the required behavior. See this example from kernel/workqueue.c:

  INIT_LIST_HEAD(&wq->list);
  // ...
   case CPU_UP_CANCELED:
            list_for_each_entry(wq, &workqueues, list) {
                    if (!per_cpu_ptr(wq->cpu_wq, hotcpu)->thread)

wait queues and work queues, do they always go together?

As part of the probe callback, this driver initializes a work queue which is part of the driver’s private structure and adds itself to the Q. But I do not see any blocking of any kind anywhere.

I think you meant the wait queue head, not the work queue. I do not see any evidence of the probe adding itself to the queue; it is merely initializing the queue.

The queue is used by the calls to the wait_event_timeout() macro in the bcmgenet_mii_read() and bcmgenet_mii_write() functions in bcmmii.c. These calls will block until either the condition they are waiting for becomes true or the timeout period elapses. They are woken up by the wake_up(&priv->wq); call in the ISR0 interrupt handler.

Then it goes on to initialize a work queue with a function to call when woken up.

It is initializing a work item, not a work queue. The function will be called from a kernel thread as a result of the work item being added to the system work queue.

Now coming to the ISR0 for the driver, within that is an explicit call to the scheduler as part of the ISR (bcmgenet_isr0) if certain conditions are met. Now AFAIK, this call is used to defer work to a later time, much like a tasklet does.

You are referring to the schedule_work(&priv->bcmgenet_irq_work); call in the ISR0 interrupt handler. This is adding the previously mentioned work item to the system work queue. It is similar to as tasklet, but tasklets are run in a softirq context whereas work items are run in a process context.

Post this we check some MDIO status flags and if the conditions are met, we wake up the process which was blocked in process context. But where exactly is the process blocked?

As mentioned above, the process is blocked in the bcmgenet_mii_read() and bcmgenet_mii_write() functions, although they use a timeout to avoid blocking for long periods. (This timeout is especially important for those versions of GENET that do not support MDIO-related interrupts!)

Also, most of the time, wait queues seem to be used in conjunction with work queues. Is that the typical way to use them?

Not especially. This particular driver uses both a wait queue and a work item, but I wouldn't describe them as being used "in conjunction" since they are being used to handle different interrupt conditions.

Why message queues are in Kernel address space but not the shared memory

I would say message queues are maintained in kernel space for (a) historical reasons and (b) architectural reasons -- they are modeled as a kernel-managed resource: they are only created, modified, and deleted according to the defined API. That means, for example, once a process sends a message, it can't be modified in flight, it can only be received. Access controls are also imposed on objects in the queue. Managing and enforcing the details of the API would be difficult if it were maintained in user space memory.

That being said, apart from the security/assurance aspects, you probably could actually implement message queues with the same API using a shared memory area and have it be completely transparent to consuming applications.

For shared memory itself, the key is it's shared. That means in order to fulfill its function, it must be accessible in the virtual address spaces of process A and process B at the same time. If process A stores a byte at a given offset in the shared memory, process B should (ideally) see that modification near-instantaneously (though obviously there will always be a potential for cache delays and so forth in multi-processor systems). And user-space processes are never allowed to directly modify kernel virtual addresses so the shared mapping must be created in user virtual address space (though there's no reason the kernel could not also map the same region into kernel virtual address space).

Wait queue and race condition

Yes, this code is actually raced, between prepare_to_wait and schedule calls it should be a check for the condition (and breaking if it is satisfied).

Funny enough, in the following description the book refers (at page 60) to the implementation of function inotify_read() in fs/notify/inotify/inotify_user.c file:

DEFINE_WAIT(wait);
...
while (1) {
  prepare_to_wait(&group->notification_waitq,
    &wait,
    TASK_INTERRUPTIBLE);

  if (<condition>) // very simplified form of checks
    break;
  if (signal_pending(current))
    break;

  schedule();
}
finish_wait(&group->notification_waitq, &wait);
...

which according to the author "follows the pattern":

This function follows the pattern laid out in our example. The main difference is that
it checks for the condition in the body of the while() loop, instead of in the while()
statement itself. This is because checking the condition is complicated and requires grabbing locks. The loop is terminated via break.

However, that code exhibits another pattern, which checks the condition between prepare_to_wait and schedule call. And that code is actually correct (race-free).
Also that code doesn't use add_wait_queue, which is superfluous in presence of prepare_to_wait.

In another book of the same author, Linux Driver Development (3d revision), usage of wait queues seems to be more accurate. See e.g. chapter 6, Advanced Char Driver Operations.

Linux kernel: how to wait in multiple wait queues?

One of possible scenarios for wait in several waitqueues:

int ret = 0; // Result of waiting; in form 0/-err.

// Define wait objects, one object per waitqueue.
DEFINE_WAIT_FUNC(wait1, default_wake_function);
DEFINE_WAIT_FUNC(wait2, default_wake_function);

// Add ourselves to all waitqueues.
add_wait_queue(wq1, &wait1);
add_wait_queue(wq2, &wait2);

// Waiting cycle
while(1) {
    // Change task state for waiting.
    // NOTE: this should come **before** condition checking for avoid races.
    set_current_state(TASK_INTERRUPTIBLE);
    // Check condition(s) which we are waiting
    if(cond) break;
    // Need to wait
    schedule();
    // Check if waiting has been interrupted by signal
    if (signal_pending(current)) {
        ret = -ERESTARTSYS;
        break;
    }               
}
// Remove ourselves from all waitqueues.
remove_wait_queue(wq1, &wait1);
remove_wait_queue(wq2, &wait2);
// Restore task state
__set_current_state(TASK_RUNNING);
// 'ret' contains result of waiting.

Note, that this scenario is slightly different from one of wait_event:

wait_event uses autoremove_wake_function for wait object (created with DEFINE_WAIT). This function, called from wake_up(), removes wait object from the queue. So it is needed to re-add wait object into the queue each iteration.

But in case of multiple waitqueues it is impossible to know, which waitqueue has fired. So following this strategy would require to re-add every wait object every iteration, which is inefficient.

Instead, our scenario uses default_wake_function for wait object, so the object is not removed from the waitqueue on wake_up() call, and it is sufficient to add wait object to the queue only once, before the loop.

How to implement queues using linux kernel list?

See list_add example in following link
https://isis.poly.edu/kulesh/stuff/src/klist/

Assuming p has struct list_head list member.
This is how your code will look like

 void insertProcQ(struct list_head *q, struct proc *p)
 {
        list_add(&(p->list), q);
 }

What is the difference between Message Queues and files in Linux. Also, what is the significance of priorities in message queues?

Files are mostly stored on the hard disk and are persistent across multiple system reboots. Message queues are stored in the main memory and are available only while the system is running and not preserved across system reboots. A file is a sequence of bytes, and by default, the system does not enforce a structure on those bytes. So, using a file for IPC is difficult, because extra logic would be required to identify messages. Also, there are problems of multiple programs writing to a file at the same time, deleting old messages, etc. Message queues are fast, because the messages are stored in the main memory. The kernel takes care of the problems like the concurrent access of a message queue by multiple processes, addition and deletion of messages. So files are used for storing information on secondary storage whereas message queues are used for inter-process communication.

Are running processes ever put into a sleep/wait queue automatically by the kernel?

So my question is: Are any processes ever put into the wait queue without the code explicitly entering itself into it?

There isn't one wait queue ("the wait queue" vs. "a wait queue").

Instead, you can expect some kind of data structure (e.g. wait queue) for each resources that a task could be waiting for - e.g. maybe one for each mutex, one for each pipe, one for each network socket, etc. Also, for modern kernels (designed for lots of CPUs) it relatively common to have even more wait queues to improve scalability and lock contention - e.g. for something like time ("sleep()") you might have one queue per CPU (instead of one global queue that all CPUs share).

Of course it's not that simple either. A wait queue might not be a queue and could be something else (e.g. a tree); and (within reason) there can be multiple resources associated with a wait queue (e.g. instead of one for each separate pipe you might use a hash table approach and always have 4096 queues regardless of how many pipes there are).

So my question is: Are any processes ever put into the wait queue without the code explicitly entering itself into it?

Yes. There are 2 cases where this is extremely likely:

a) the scheduler (when tasks are waiting for CPU time) will have some kind of "wait something" (tree? List for each thread priority?). This isn't much different to any other case - a CPU is just another resource, and waiting for CPU is just waiting for a resource.

b) virtual memory tricks. For example, if a process needs to wait until data is fetched from swap space (or from a memory mapped file, or ..) then the process will be put on some kind of "wait something".

Queues in The Linux Kernel