Linux Device Driver File Operations: It Is Possible to Have Race Conditions

linux device driver file operations: it is possible to have race conditions?

Device driver code runs in the process which invoked the system calls. The kernel does not have an implicit "module lock" it locks before invoking module code. It is definitely possible for concurrent driver calls to happen when separate processes invoke system calls that end up in your driver code.

As you might expect, the kernel favors simplicity and performance over ease of implementation. It is up to you to grab the necessary spinlocks and semaphores when accessing shared state.

See Chapter 5 of Linux Device Drivers, which talks about concurrency and race conditions in great detail.

Concurrency and Its Management

In a modern Linux system, there are numerous sources of concurrency and, therefore, possible race conditions. Multiple user-space processes are running, and they can access your code in surprising combinations of ways. SMP systems can be executing your code simultaneously on different processors. Kernel code is preemptible; your driver’s code can lose the processor at any time, and the process that replaces it could also be running in your driver. Device interrupts are asynchronous events that can cause concurrent execution of your code. The kernel also provides various mechanisms for delayed code execution, such as workqueues, tasklets, and timers, which can cause your code to run at any time in ways unrelated to what the current process is doing. In the modern, hot-pluggable world, your device could simply disappear while you are in the middle of working with it.

My attributes are way too racy, what should I do?

Quoting (emphasis mine) Greg Kroah-Hartman from his comment to a merge request (that was merged by Linus as a part of 3.11 development cycle):

Here are some driver core patches for 3.11-rc2. They aren't really
bugfixes, but a bunch of new helper macros for drivers to properly
create attribute groups, which drivers and subsystems need to fix up a
ton of race issues with incorrectly creating sysfs files (binary and
normal) after userspace has been told that the device is present.

Also here is the ability to create binary files as attribute groups, to solve that race condition, which was impossible to do before this, so that's my fault the drivers were broken.

So it looks like there really is no way to solve this problem on old kernels.

Linux Device Driver Access Control

The example you point to is indeed flawed. The decrement is absolutely not guaranteed to be atomic and almost certainly will not be.

But realistically, I don't think there is a compiler/CPU combination that would ever produce code that could fail. The worst that could happen is one CPU core could finish the close, then another core could call open and get back busy because it has a stale cached value of the flag.

Linux provides atomic_* functions for this and also *_bit atomic bit flag operations. See core_api/atomic_ops.rst in the kernel documentation.

A example of a correct and simple pattern to do his would look something like:

unsigned long mydriver_flags;
#define IN_USE_BIT 0

static int mydriver_open(struct inode *inode, struct file *filp)
{
        if (test_and_set_bit(IN_USE_BIT, &mydriver_flags))
                return -EBUSY;
        /* continue with open */
        return 0;
}

static int mydriver_close(struct inode *inode, struct file *filp)
{
        /* do close stuff first */
        smp_mb__before_atomic();
        clear_bit(IN_USE_BIT, &mydriver_flags);
        return 0;
}

A real driver should have a device state struct, for each device, with mydriver_flags in it. Rather than using a single global for the whole driver as shown in the example.

That said, what you are trying to do is probably not a good idea. Even if only one process can open the device at a time, a process' open file descriptors are shared amongst all threads in the process. Multiple threads can make simultaneous read() and write() calls to the same file descriptor.

If a process has a file descriptor open and calls fork(), that descriptor will be inherited into the new process. This is a way multiple processes can have the device open at once despite the above "single open" restriction.

So you still have to be thread safe in your driver's file operations since a user can still have multiple threads/processes open the device at once and make simultaneous calls. And if you've made it safe, why prevent the user from doing it? Maybe they know what they're doing and will insure their multiple openers of the driver will "take turns" and not make calls that conflict?

Also consider the possibility using the O_EXCL flag in the open call to make single open optional.

Does a kernel driver's `release` file-operations handler wait for other fops to finish?

could 1 thread of a process be inside the ioctl handler for the file (or fd), while another thread of the same process is inside of the release handler?

No. The release entry point is called when the reference counter on the
file entry is 0. ioctl() increments the reference counter on the file. So, the release entry point will not be called while an ioctl() is on tracks.

Foreword

The source code discussed below is:

GLIBC 2.31
Linux 5.4

GLIBC's pthread management

The GLIBC's pthread_create() actually involves a clone() system call with
the following flags:

CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID

According to the manual of clone(), the CLONE_FILES flag makes the threads of a process

share the same file descriptor table. Any file descriptor created by

one thread is also valid in the other threads. Similarly, if one thread closes a file descriptor, or changes its associated flags (using the fcntl() F_SETFD operation), the other threads are also affected.

clone() on the kernel side

When clone() is passed CLONE_FILES, the files_struct is not duplicated but a reference counter is incremented. As a consequence, the task structures of both threads point on the same files_struct (files field):

. The task structure is defined in include/linux/sched.h:

struct task_struct {
[...]
    /* Open file information: */
    struct files_struct     *files; /// <==== Table of open files shared between thread
[...]

. In kernel/fork.c, the clone() service calls copy_files() to increment the reference counter on the files_struct

static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
{
    struct files_struct *oldf, *newf;
    int error = 0;

    /*
     * A background process may not have any files ...
     */
    oldf = current->files;
    if (!oldf)
        goto out;

    if (clone_flags & CLONE_FILES) {
      atomic_inc(&oldf->count);  // <==== Ref counter incremented: files_struct is shared
        goto out;
    }

    newf = dup_fd(oldf, &error);
    if (!newf)
        goto out;

    tsk->files = newf;
    error = 0;
out:
    return error;
}

. The files_struct is defined in include/linux/fdtable.h:

/*
 * Open file table structure
 */
struct files_struct {
  /*
   * read mostly part
   */
        atomic_t count;  // <==== Reference counter
    bool resize_in_progress;
    wait_queue_head_t resize_wait;

    struct fdtable __rcu *fdt;
    struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
    spinlock_t file_lock ____cacheline_aligned_in_smp;
    unsigned int next_fd;
    unsigned long close_on_exec_init[1];
    unsigned long open_fds_init[1];
    unsigned long full_fds_bits_init[1];
    struct file __rcu * fd_array[NR_OPEN_DEFAULT];

ioctl() operation

ioctl() system call is defined fs/ioctl.c. It calls fdget() first to increment the reference counter on the file entry, do the requested operation and then call fdput()

int ksys_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
{
    int error;
    struct fd f = fdget(fd);

    if (!f.file)
        return -EBADF;
    error = security_file_ioctl(f.file, cmd, arg);
    if (!error)
        error = do_vfs_ioctl(f.file, fd, cmd, arg);
    fdput(f);
    return error;
}

SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{
    return ksys_ioctl(fd, cmd, arg);
}

The file entry is defined in include/linux/fs.h. Its reference counter is the f_count field:

struct file {
    union {
        struct llist_node   fu_llist;
        struct rcu_head     fu_rcuhead;
    } f_u;
    struct path     f_path;
    struct inode        *f_inode;   /* cached value */
    const struct file_operations    *f_op;

    /*
     * Protects f_ep_links, f_flags.
     * Must not be taken from IRQ context.
     */
    spinlock_t      f_lock;
    enum rw_hint        f_write_hint;
        atomic_long_t       f_count;  // <===== Reference counter
    unsigned int        f_flags;
[...]
} __randomize_layout
  __attribute__((aligned(4)));

Example

Here is a simple device driver into which the file operations merely display a message when they are triggered. The ioctl() entry makes the caller sleep 5 seconds:

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/kdev_t.h>
#include <linux/cdev.h>
#include <linux/uaccess.h>
#include <linux/slab.h>
#include <linux/delay.h>

MODULE_LICENSE("GPL");

#define DEVICE_NAME "device"

static int device_open(struct inode *, struct file *);
static int device_release(struct inode *, struct file *);
static ssize_t device_read(struct file *, char *, size_t, loff_t *);
static ssize_t device_write(struct file *, const char *, size_t, loff_t *);
static long int device_ioctl(struct file *, unsigned int, unsigned long);
static int device_flush(struct file *, fl_owner_t);

static const struct file_operations fops = {
    .owner = THIS_MODULE,
    .read = device_read,
    .write = device_write,
    .unlocked_ioctl = device_ioctl,
    .open = device_open,
    .flush = device_flush,
    .release = device_release
};

struct cdev *device_cdev;
dev_t deviceNumbers;

static  int __init init(void)
{
  // This returns the major number chosen dynamically in deviceNumbers
  int ret = alloc_chrdev_region(&deviceNumbers, 0, 1, DEVICE_NAME);

  if (ret < 0) {
    printk(KERN_ALERT "Error registering: %d\n", ret);
    return -1;
  }

  device_cdev = cdev_alloc();

  cdev_init(device_cdev, &fops);

  ret = cdev_add(device_cdev, deviceNumbers, 1);

  printk(KERN_INFO "Device initialized (major number is %d)\n", MAJOR(deviceNumbers));

  return 0;
}

static void __exit cleanup(void)
{
  unregister_chrdev_region(deviceNumbers, 1);

  cdev_del(device_cdev);

  printk(KERN_INFO "Device unloaded\n");
}

static int device_open(struct inode *inode, struct file *file)
{
  printk(KERN_INFO "Device open\n");
  return 0;
}

static int device_flush(struct file *file, fl_owner_t id)
{
  printk(KERN_INFO "Device flush\n");
  return 0;
}

static int device_release(struct inode *inode, struct file *file)
{
  printk(KERN_INFO "Device released\n");
  return 0;
}

static ssize_t device_write(struct file *filp, const char *buff, size_t len, loff_t * off)
{
  printk(KERN_INFO "Device write\n");
  return len;
}

static ssize_t device_read(struct file *filp, char *buff, size_t len, loff_t * off)
{
  printk(KERN_INFO "Device read\n");
  return 0;
}

static long int device_ioctl(struct file *file, unsigned int ioctl_num, unsigned long ioctl_param)
{
  printk(KERN_INFO "Device ioctl enter\n");
  msleep_interruptible(5000);
  printk(KERN_INFO "Device ioctl out\n");
  return 0;
}

module_init(init);
module_exit(cleanup);

Here is a user space program which involves the main thread and a secondary one. The main thread opens the above device and waits for the secondary thread to start (barrier) before closing the device after 1 second. Meanwhile, the secondary thread calls ioctl() on the above device which makes it sleep 5 seconds. Then it calls ioctl() a second time before exiting.

The expected behavior is to make the main thread close the device file while the secondary thread is running the ioctl().

#include <stdio.h>
#include <pthread.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <errno.h>

static int dev_fd;

static pthread_barrier_t barrier;

void *entry(void *arg)
{
  int rc;

  printf("Thread running...\n");

  // Rendez-vous with main thread
  pthread_barrier_wait(&barrier);

  rc = ioctl(dev_fd, 0);
  printf("rc = %d, errno = %d\n", rc, errno);
  
  rc = ioctl(dev_fd, 0);
  printf("rc = %d, errno = %d\n", rc, errno);

  return NULL;
}

int main(void)
{
  pthread_t tid;

  dev_fd = open("/dev/device", O_RDWR);

  pthread_barrier_init(&barrier, NULL, 2);

  pthread_create(&tid,NULL, entry, NULL);

  pthread_barrier_wait(&barrier);

  sleep(1);

  close(dev_fd);

  pthread_join(tid,NULL);

  return 0;
}

Installation of the kernel module:

$ sudo insmod ./device.ko
$ dmesg
[13270.589766] Device initialized (major number is 237)
$ sudo mknod /dev/device c 237 0
$ sudo chmod 666 /dev/device 
$ ls -l /dev/device 
crw-rw-rw- 1 root root 237, 0 janv.  27 10:55 /dev/device

The execution of the program shows that the first ioctl() makes the thread wait 5 seconds. But the second returns in error with EBADF (9) because meanwhile the device file has been closed by the main thread:

$ gcc p1.c -lpthread
$ ./a.out
Thread running...
rc = 0, errno = 0
rc = -1, errno = 9

In the kernel log, we can see that the close() in the main thread merely triggered a flush() operation on the device while the first ioctl() was on tracks in the secondary thread. Then, once the first ioctl() returned, the internals of the kernel freed the file entry (reference counter dropped to 0) and so, the second ioctl() did not reach the device as the file descriptor no longer referenced an opened file. Hence, the EBADF error on the second call:

[13270.589766] Device initialized (major number is 237)
[13656.862951] Device open        <==== Open() in the main thread
[13656.863315] Device ioctl enter <==== 1st ioctl() in secondary thread
[13657.863523] Device flush       <==== 1 s later, flush() = close() in the main thread
[13661.941238] Device ioctl out   <==== 5 s later, the 1st ioctl() returns
[13661.941244] Device released    <==== The file is released because the reference counter reached 0

Linux kernel, I2C SPI buses, and time slices?

If I get your question right, the whole problem is about locking the device that is used by multiple programs (threads, etc.) and this post may have an answer to it.

Your professor was partially right. The problem was discovered a long time ago, it's called the race conditions. But, the solution isn't provided by the kernel itself in a fancy way. It is the user's responsibility to implement a mechanism that can prevent race conditions on devices or any other resources.

select/poll and one write buffer

This is a type of race condition and is not unique to your device. The application must be prepared for this eventuality. That is, just because select returns a file descriptor as being writable (or readable or whatever) does not guarantee that a subsequent system call on the file will not block.

The usual way of handling this is to open the file descriptor in non-blocking mode (O_NONBLOCK or O_NDELAY). Then when the situation you described happens, UserA will get a write error (with errno EWOULDBLOCK/EAGAIN), and should then return to select to await the device becoming writable again.

Wait queue and race condition

Yes, this code is actually raced, between prepare_to_wait and schedule calls it should be a check for the condition (and breaking if it is satisfied).

Funny enough, in the following description the book refers (at page 60) to the implementation of function inotify_read() in fs/notify/inotify/inotify_user.c file:

DEFINE_WAIT(wait);
...
while (1) {
  prepare_to_wait(&group->notification_waitq,
    &wait,
    TASK_INTERRUPTIBLE);

  if (<condition>) // very simplified form of checks
    break;
  if (signal_pending(current))
    break;

  schedule();
}
finish_wait(&group->notification_waitq, &wait);
...

which according to the author "follows the pattern":

This function follows the pattern laid out in our example. The main difference is that
it checks for the condition in the body of the while() loop, instead of in the while()
statement itself. This is because checking the condition is complicated and requires grabbing locks. The loop is terminated via break.

However, that code exhibits another pattern, which checks the condition between prepare_to_wait and schedule call. And that code is actually correct (race-free).
Also that code doesn't use add_wait_queue, which is superfluous in presence of prepare_to_wait.

In another book of the same author, Linux Driver Development (3d revision), usage of wait queues seems to be more accurate. See e.g. chapter 6, Advanced Char Driver Operations.

race condition between wait_event and wake_up

Since the interrupt can occur between the call to wait_event_interruptible and flag = 0, it would affect the flag variable in an unwanted way.

Note that even on a UP machine, the kernel could be preemptive depending on the configuration, and that code would be affected as a result.

Also, I advice to not use a simple 'int' flag. Instead, you should use atomic_t and atomic_dec/inc_* operations. See the implementation of completions inside the kernel, it does something similar to what you are doing here.

About the question itself:

If you'll look in the code of wait_event_interruptible you'll see that the sleep doesn't take place if the condition is true - so your problem is a non-problem.

Linux Device Driver File Operations: It Is Possible to Have Race Conditions