What is the purpose of epoll's edge triggered option?
When an FD becomes read or write ready, you might not necessarily want to read (or write) all the data immediately.
Level-triggered epoll will keep nagging you as long as the FD remains ready, whereas edge-triggered won't bother you again until the next time you get an EAGAIN
(so it's more complicated to code around, but can be more efficient depending on what you need to do).
Say you're writing from a resource to an FD. If you register your interest for that FD becoming write ready as level-triggered, you'll get constant notification that the FD is still ready for writing. If the resource isn't yet available, that's a waste of a wake-up, because you can't write any more anyway.
If you were to add it as edge-triggered instead, you'd get notification that the FD was write ready once, then when the other resource becomes ready you write as much as you can. Then if write(2)
returns EAGAIN
, you stop writing and wait for the next notification.
The same applies for reading, because you might not want to pull all the data into user-space before you're ready to do whatever you want to do with it (thus having to buffer it, etc etc). With edge-triggered epoll you get told when it's ready to read, and then can remember that and do the actual reading "as and when".
How to use an eventfd with level triggered behaviour on epoll?
When you write to an eventfd
, a function eventfd_signal
is called. It contains the following line which does the wake up:
wake_up_locked_poll(&ctx->wqh, EPOLLIN);
With wake_up_locked_poll
being a macro:
#define wake_up_locked_poll(x, m) \
__wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))
With __wake_up_locked_key
being defined as:
void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key)
{
__wake_up_common(wq_head, mode, 1, 0, key, NULL);
}
And finally, __wake_up_common
is being declared as:
/*
* The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just
* wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve
* number) then we wake all the non-exclusive tasks and one exclusive task.
*
* There are circumstances in which we can try to wake a task which has already
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode,
int nr_exclusive, int wake_flags, void *key,
wait_queue_entry_t *bookmark)
Note the nr_exclusive
argument and you will see that writing to an eventfd
wakes only one exclusive waiter.
What does exclusive mean? Reading epoll_ctl
man page gives us some insight:
EPOLLEXCLUSIVE (since Linux 4.5):
Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using
EPOLLEXCLUSIVE
, one or more of the epoll file descriptors will receive an event withepoll_wait(2)
.
You do not use EPOLLEXCLUSIVE
when adding your event, but to wait with epoll_wait
every thread has to put itself to a wait queue. Function do_epoll_wait
performs the wait by calling ep_poll
. By following the code you can see that it adds the current thread to a wait queue at line #1903:
__add_wait_queue_exclusive(&ep->wq, &wait);
Which is the explanation for what is going on - epoll waiters are exclusive, so only a single thread is woken up. This behavior has been introduced in v2.6.22-rc1 and the relevant change has been discussed here.
To me this looks like a bug in the eventfd_signal
function: in semaphore mode it should perform a wake-up with nr_exclusive
equal to the value written.
So your options are:
- Create a separate epoll descriptor for each thread (might not work with your design - scaling problems)
- Put a mutex around it (scaling problems)
- Use
poll
, probably on botheventfd
and epoll - Wake each thread separately by writing 1 with
evenfd_write
4 times (probably the best you can do).
How could a recv() call block when epoll has signalled activity?
From the Linux select man-page:
Under Linux, select() may report a socket file descriptor as "ready
for reading", while nevertheless a subsequent read blocks. This
could for example happen when data has arrived but upon examination
has wrong checksum and is discarded. There may be other
circumstances in which a file descriptor is spuriously reported as
ready. Thus it may be safer to use O_NONBLOCK on sockets that should
not block.
(yeah, I know epoll() is not the same as select(), but I suspect the same underlying conditions apply to both)
I think if you really want to avoid blocking, the only safe way to accomplish that is to set your socket to non-blocking mode.
Related Topics
How to Convert Pe(Portable Executable) Format to Elf in Linux
How to Run Processes Piped with Bash on Multiple Cores
Install Packages in Alpine Docker
Automatic Docker Login Within a Bash Script
Elasticsearch Can't Write to Log Files
Insert Characters into a String in Bash
Creating Symbolic Link: Protocol Error
Running as a Host User Within a Docker Container
Emulating Linux Binaries Under MAC Os X
Eclipse - Changing Font Size in Project/Package Explorer
Flex Development on Linux, What's a Good Free Environment
When a Parent Process Is Killed by "Kill -9", Will Subprocess Also Be Killed
Docker Ignores Limits.Conf (Trying to Solve "Too Many Open Files" Error)
How to Configure Curl to Only Show Percentage
Open Vim from Within a Bash Shell Script
Linux Kernel Interrupt Handler Mutex Protection
Detect the Presence of a Device When It's Hot Plugged in Linux