How Does Epoll's Epollexclusive Mode Interact with Level-Triggering

What is the purpose of epoll's edge triggered option?

When an FD becomes read or write ready, you might not necessarily want to read (or write) all the data immediately.

Level-triggered epoll will keep nagging you as long as the FD remains ready, whereas edge-triggered won't bother you again until the next time you get an EAGAIN (so it's more complicated to code around, but can be more efficient depending on what you need to do).

Say you're writing from a resource to an FD. If you register your interest for that FD becoming write ready as level-triggered, you'll get constant notification that the FD is still ready for writing. If the resource isn't yet available, that's a waste of a wake-up, because you can't write any more anyway.

If you were to add it as edge-triggered instead, you'd get notification that the FD was write ready once, then when the other resource becomes ready you write as much as you can. Then if write(2) returns EAGAIN, you stop writing and wait for the next notification.

The same applies for reading, because you might not want to pull all the data into user-space before you're ready to do whatever you want to do with it (thus having to buffer it, etc etc). With edge-triggered epoll you get told when it's ready to read, and then can remember that and do the actual reading "as and when".

How to use an eventfd with level triggered behaviour on epoll?

When you write to an eventfd, a function eventfd_signal is called. It contains the following line which does the wake up:

wake_up_locked_poll(&ctx->wqh, EPOLLIN);

With wake_up_locked_poll being a macro:

#define wake_up_locked_poll(x, m)                       \
    __wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))

With __wake_up_locked_key being defined as:

void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
{
    __wake_up_common(wq_head, mode, 1, 0, key, NULL);
}

And finally, __wake_up_common is being declared as:

/*
 * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
 * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
 * number) then we wake all the non-exclusive tasks and one exclusive task.
 *
 * There are circumstances in which we can try to wake a task which has already
 * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
 * zero in this (rare) case, and we handle it by continuing to scan the queue.
 */
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
            int nr_exclusive, int wake_flags, void *key,
            wait_queue_entry_t *bookmark)

Note the nr_exclusive argument and you will see that writing to an eventfd wakes only one exclusive waiter.

What does exclusive mean? Reading epoll_ctl man page gives us some insight:

EPOLLEXCLUSIVE (since Linux 4.5):

Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2).

You do not use EPOLLEXCLUSIVE when adding your event, but to wait with epoll_wait every thread has to put itself to a wait queue. Function do_epoll_wait performs the wait by calling ep_poll. By following the code you can see that it adds the current thread to a wait queue at line #1903:

__add_wait_queue_exclusive(&ep->wq, &wait);

Which is the explanation for what is going on - epoll waiters are exclusive, so only a single thread is woken up. This behavior has been introduced in v2.6.22-rc1 and the relevant change has been discussed here.

To me this looks like a bug in the eventfd_signal function: in semaphore mode it should perform a wake-up with nr_exclusive equal to the value written.

So your options are:

Create a separate epoll descriptor for each thread (might not work with your design - scaling problems)
Put a mutex around it (scaling problems)
Use poll, probably on both eventfd and epoll
Wake each thread separately by writing 1 with evenfd_write 4 times (probably the best you can do).

How could a recv() call block when epoll has signalled activity?

From the Linux select man-page:

Under Linux, select() may report a socket file descriptor as "ready
for reading", while nevertheless a subsequent read blocks. This
could for example happen when data has arrived but upon examination
has wrong checksum and is discarded. There may be other
circumstances in which a file descriptor is spuriously reported as
ready. Thus it may be safer to use O_NONBLOCK on sockets that should
not block.

(yeah, I know epoll() is not the same as select(), but I suspect the same underlying conditions apply to both)

I think if you really want to avoid blocking, the only safe way to accomplish that is to set your socket to non-blocking mode.

How Does Epoll's Epollexclusive Mode Interact with Level-Triggering

What is the purpose of epoll's edge triggered option?

How to use an eventfd with level triggered behaviour on epoll?

How could a recv() call block when epoll has signalled activity?

Related Topics

Leave a reply