Why Do We Need to Call Poll_Wait in Poll

Why do we need to call poll_wait in poll?

poll_wait adds your device (represented by the "struct file") to the list of those that can wake the process up.

The idea is that the process can use poll (or select or epoll etc) to add a bunch of file descriptors to the list on which it wishes to wait. The poll entry for each driver gets called. Each one adds itself (via poll_wait) to the waiter list.

Then the core kernel blocks the process in one place. That way, any one of the devices can wake up the process. If you return non-zero mask bits, that means those "ready" attributes (readable/writable/etc) apply now.

So, in pseudo-code, it's roughly like this:

foreach fd:
find device corresponding to fd
call device poll function to setup wait queues (with poll_wait) and to collect its "ready-now" mask

while time remaining in timeout and no devices are ready:
sleep

return from system call (either due to timeout or to ready devices)

How do poll_wait() and wake_up_interruptible() work in sync?

I see that the process is put to sleep by poll_wait() in poll file operation ...

No, you got it wrong.

A call to poll_wait just sets a state of the current process into non-runnable and adds the current process into the wait queue. Both these operations are non-sleeping, so poll_wait returns. After that, the poll file operation continues till the end and returns a mask of the available operations.

It is the caller of the file's poll operation who will call schedule() and will put the current process into the sleep. But schedule() will be called only in case when returning mask of available operations does not intercept the mask of requested operations.

As you could see, the poll method firstly calls poll_wait and only then calculates the mask:

poll_wait(filep, &idev->wait, wait);
if (listener->event_count != atomic_read(&idev->event))
return EPOLLIN | EPOLLRDNORM;
return 0;

So, in case when wake_up_interruptible is called before poll_wait, the operation returns (EPOLLIN | EPOLLRDNORM), and no sleeping will be performed (in case the file is polled for reading).

In case when wake_up_interruptible is called after poll_wait, it will return the process into runnable state, so schedule() won't put it into the sleep. After call to schedule(), the poll operation will be re-run, and that time it returns non-zero mask.

poll exiting immidiately from driver

When someone calls poll() system call on device file then vfs layer handles it this way.

It calls your driver's poll handler (henceforth referred as my_poll).

  1. If you do not call poll_wait then upper layer will just return mask without analyzing its value.
  2. If you add your driver/calling process to a wait queue using poll_wait() then calling process will be added to wait queue.

Do not get deceived by the API's name poll_wait(). It doesn't actually wait or sleeps. It merely adds your process to the list of waiting processes for receiving event notification.

After returning from poll_wait(), the entire behavior depends on the mask returned. Value of this mask is used by VFS layer. If the mask returned is zero, it'll put the calling process to sleep. Now the process is waiting for the event to happen.

When someone awakens this wait queue, all waiting processes gets notification. VFS layer again calls my_poll() & checks returned value of mask.

Above process continues until VFS layer receives a non-zero mask.
It means my_poll() will be called multiple times. So anyone who wants to implement poll/read in his device driver, should check whether device is actually ready for read/write operation before returning readiness mask.
Important thing to note here is that do not assume that poll_wait() will sleep until someone awakens that queue.

How to add poll function to the kernel module code?

You can find some good examples in kernel itself. Take a look at next files:

  • drivers/rtc/dev.c, drivers/rtc/interface.c
  • kernel/printk/printk.c
  • drivers/char/random.c

To add poll() function to your code follow next steps.

  1. Include needed headers:

     #include <linux/wait.h>
    #include <linux/poll.h>
  2. Declare waitqueue variable:

     static DECLARE_WAIT_QUEUE_HEAD(fortune_wait);
  3. Add fortune_poll() function and add it (as .poll callback) to your file operations structure:

     static unsigned int fortune_poll(struct file *file, poll_table *wait)
    {
    poll_wait(file, &fortune_wait, wait);
    if (new-data-is-ready)
    return POLLIN | POLLRDNORM;
    return 0;
    }

    static const struct file_operations proc_test_fops = {
    ....
    .poll = fortune_poll,
    };

    Note that you should return POLLIN | POLLRDNORM if you have some new data to read, and 0 in case there is no new data to read (poll() call timed-out). See man 2 poll for details.

  4. Notify your waitqueue once you have new data:

     wake_up_interruptible(&fortune_wait);

That's the basic stuff about implementing poll() operation. Depending on your task, you may be needed to use some waitqueue API in your .read function (like wait_event_interruptible()).


See also related question: Implementing poll in a Linux kernel module.

Calls are getting routed to the driver, when application uses poll() and not with epoll() in linux

The implementation for poll and epoll are different.

Before all, we know the poll of driver always calls poll_wait(). This is the most important difference for these 2 system calls.

poll/select

The poll of driver is called every time when poll/select is called from userspace. It add current process to the wait queue and add the wait queue to poll_table.

  1. Userspace polled 2 different file descriptors
  2. kernel called every file descriptors' poll driver.
  3. The poll drivers called poll_wait. It added current process in the poll_table
  4. Assume that they were not ready either. So there were 2 wait queues in the poll table.
  5. When 1 of the file descriptor was ready, it waked the process up.
  6. The waked process then called every file descriptors' poll driver again to check which file descriptor was ready.
  7. At last, it returned to userspace.

epoll

The poll of driver is only called by epoll_ctl.

  1. Userspace called epoll_ctl to setup 2 different file descriptors
  2. kernel called every file descriptors' poll driver.
  3. The poll drivers called poll_wait. But this time poll_wait is different from poll/select. It not only add current process in the poll_table but also changed the call back function to ep_poll_callback when the process was waked.
  4. Assume that they were not ready either. So now there were 2 wait queues in the poll table.
  5. When 1 of the file descriptor was ready, it waked the process up.
  6. So the ep_poll_callback is called. It add the corresponding file descriptor to the ready queue of epoll
  7. epoll_wait checked the ready queue periodically and found the ready one.
  8. At last, it returned to userspace.

in linux char device driver, what does the poll_queue_proc function do?

drivers/char/random.c:random_poll() is called when userspace calls
select() (or poll() or epoll_wait() for that matter) with a file
descriptor referring to /dev/random.

These system calls are at the basis of event multiplexing. In the
following program, userspace opens a number of input sources (say
/dev/random and /dev/ttyS4) and calls select() on both of
them
to block until any of them has input data to be read. (There
are other event sources than input, input is just the simplest.)

#include <sys/select.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>

#define _SYSE(ret, msg) do { \
if (ret == -1) { \
perror(msg); \
exit(EXIT_FAILURE); \
} \
} while (0)

static int /*bool: EOF detected*/ consume_fd(int fd, const char* msg)
{
char tmp[64];
ssize_t nread;

nread = read(fd, tmp, sizeof(tmp));
_SYSE(nread, "read");
if (nread == 0 /*EOF*/)
return 1;

printf("%s: consumed %ld bytes\n", msg, nread);
return 0;
}

int main(void)
{
int random_fd, tty_fd, nfds = 0;

random_fd = open("/dev/random", O_RDONLY);
_SYSE(random_fd, "open random");
if (random_fd > nfds)
nfds = random_fd+1;

tty_fd = open("/dev/ttyS4", O_RDONLY);
_SYSE(tty_fd, "open tty");
if (tty_fd > nfds)
nfds = tty_fd+1;

while (1) {
fd_set in_fds;
int ret;

FD_ZERO(&in_fds);
FD_SET(random_fd, &in_fds);
FD_SET(tty_fd, &in_fds);

ret = select(nfds, &in_fds, NULL, NULL, NULL);
_SYSE(ret, "select");

if (FD_ISSET(random_fd, &in_fds)) {
int eof_detected = consume_fd(random_fd, "random");
if (eof_detected)
break;
}
if (FD_ISSET(tty_fd, &in_fds)) {
int eof_detected = consume_fd(tty_fd, "tty");
if (eof_detected)
break;
}
}
return 0;
}

Output will appear once either random numbers are available, or the
serial line has data. (Note that nowadays /dev/random does not
block, but rather generates pseudo random numbers, so output is really
fast.)

It is when the select() call enters the kernel that random_poll()
is called, and another, comparable, function somewhere in the TTY
layer - simply because select() passes those file descriptors as
parameters. Those functions are supposed to simply enqueue the caller
into a poll_table that is maintained out of your reach (it
represents the calling task for that purpose).

In a second stage the implementation of select() then suspends the
caller until any of the events become true. (See fs/select.c.)

Emitting a poll/select event from a timer handler through a wait queue

Call wake_up_interruptible() on the polled queue just force the .poll method to be called again. User space process receive notification only when .poll method returns mask which have polled bits set.

Check that your .poll method actually returns non-zero mask.



Related Topics



Leave a reply



Submit