What Is the Use of _Iomem in Linux While Writing Device Drivers

What is the use of __iomem in linux while writing device drivers?

Lots of type casts are going to just "work well". However, this is not very strict. Nothing stops you from casting a u32 to a u32 * and dereference it, but this is not following the kernel API and is prone to errors.

__iomem is a cookie used by Sparse, a tool used to find possible coding faults in the kernel. If you don't compile your kernel code with Sparse enabled, __iomem will be ignored anyway.

Use Sparse by first installing it, and then adding C=1 to your make call. For example, when building a module, use:

make -C $KPATH M=$PWD C=1 modules

__iomem is defined like this:

# define __iomem        __attribute__((noderef, address_space(2)))

Adding (and requiring) a cookie like __iomem for all I/O accesses is a way to be stricter and avoid programming errors. You don't want to read/write from/to I/O memory regions with absolute addresses because you're usually using virtual memory. Thus,

void __iomem *ioremap(phys_addr_t offset, unsigned long size);

is usually called to get the virtual address of an I/O physical address offset, for a specified length size in bytes. ioremap() returns a pointer with an __iomem cookie, so this may now be used with inline functions like readl()/writel() (although it's now preferable to use the more explicit macros ioread32()/iowrite32(), for example), which accept __iomem addresses.

Also, the noderef attribute is used by Sparse to make sure you don't dereference an __iomem pointer. Dereferencing should work on some architecture where the I/O is really memory-mapped, but other architectures use special instructions for accessing I/Os and in this case, dereferencing won't work.

Let's look at an example:

void *io = ioremap(42, 4);

Sparse is not happy:

warning: incorrect type in initializer (different address spaces)
    expected void *io
    got void [noderef] <asn:2>*

Or:

u32 __iomem* io = ioremap(42, 4);
pr_info("%x\n", *io);

Sparse is not happy either:

warning: dereference of noderef expression

In the last example, the first line is correct, because ioremap() returns its value to an __iomem variable. But then, we deference it, and we're not supposed to.

This makes Sparse happy:

void __iomem* io = ioremap(42, 4);
pr_info("%x\n", ioread32(io));

Bottom line: always use __iomem where it's required (as a return type or as a parameter type), and use Sparse to make sure you did so. Also: do not dereference an __iomem pointer.

Edit: Here's a great LWN article about the inception of __iomem and functions using it.

How is /proc/io* populated?

If you pick up some book on Linux device drivers, it will state something about iomem being populated by the driver calling request_region() or something like that.

The information in /proc/iomem comes from drivers calling request_mem_region().

See Content of /proc/iomem.

how does the device driver know where the hardware register is located

The address of a device register is typically specified by either the board (for an external peripheral) or SoC designer (for an integrated peripheral), and then conveyed in the board or SoC documentation. Some boards (e.g. PC ISA adapter boards) may allow address specification by some DIP switches.

The writer of the device driver then can

(a) hardcode the device address in the driver itself or a board file, or

(b) retrieve the device address using some bus configuration method (e.g. PCI configuration space), or

(c) retrieve the device address using a (handwritten) configuration list (e.g. Device Tree, FEX, ATAGs), or

(d) try to probe the device at runtime.

Note that conveying the system configuration and device addresses to device drivers is a longstanding issue.

The IBM PC's method of assigned addresses that were then hardcoded eventually led to the plug and play initiative for x86 PCs.

Issues with unique Linux kernel builds for each and every ARM board led to the adoption of Device Tree (from PowerPC) for that architecture.

Content of /proc/iomem

1) Is it possible to access a physical address which is not defined in /proc/iomem?

Yes.

Assuming an ARM processor which memory maps all directly-connected periperals, the driver could perform an ioremap() operation to map the physical memory to virtual memory for access.

But a properly written driver would first call request_mem_region() to ensure that it can use (and lay claim to) that physical address space.

The information in /proc/iomem comes from drivers calling request_mem_region().

2) If the physical address range of a device does not appear in /proc/iomem, does it mean that the device has not been utilized/initialized yet?

You would have to inspect the driver code to determine how well written the driver is.

Is there a request_mem_region() before the ioremap()?

Check the system log using the dmesg command; perhaps driver initialization failed.

Assuming that this is a statically-linked driver rather than a loadable module, then as each kernel device driver has its init() routine called you can get trace output by having added the option "initcall_debug" on the kernel command line. If you are using U-Boot, then this option should be added to the "bootargs" variable (which is used for the kernel command line).

How to avoid high cpu usage while reading/writing character device?

As pointed out by @0andriy, you are not supposed to access iomem directly. There are functions such as memcpy_toio() and memcpy_fromio() that can copy between iomem and normal memory, but they only work on kernel virtual addresses.

NOTE: The use of get_user_pages_fast(), set_page_dirty_lock() and put_page() described below should be changed for Linux kernel version 5.6 onwards. The required changes are described later.

In order to copy from userspace addresses to iomem without using an intermediate data buffer, the userspace memory pages need to be "pinned" into physical memory. That can be done using get_user_pages_fast(). However, the pinned pages may be in "high memory" (highmem) which is outside the permanently mapped memory in the kernel. Such pages need to be temporarily mapped into kernel virtual address space for a short duration using kmap_atomic(). (There are rules governing the use of kmap_atomic(), and there are other functions for longer term mapping of highmem. Check the highmem documentation for details.)

Once a userspace page has been mapped to kernel virtual address space, memcpy_toio() and memcpy_fromio() can be used to copy between that page and iomem.

A page temporarily mapped by kmap_atomic() needs to be unmapped by kunmap_atomic().

User memory pages pinned by get_user_pages_fast() need to be unpinned individually by calling put_page(), but if the page memory has been written to (e.g. by memcpy_fromio(), it must first be flagged as "dirty" by set_page_dirty_lock() before calling put_page().

Note: Change for kernel version 5.6 onwards.
The call to get_user_pages_fast() should be changed to pin_user_pages_fast().
Dirty pages pinned by pin_user_pages_fast() should be unpinned by unpin_user_pages_dirty_lock() with the last argument set true.
Clean pages pinned by pin_user_pages_fast() should be unpinned by unpin_user_page(), unpin_user_pages(), or unpin_user_pages_dirty_lock() with the last argument set false.
put_page() must not be used to unpin pages pinned by pin_user_pages_fast().
For code to be compatible with earlier kernel versions, the availability of pin_user_pages_fast(), unpin_user_page(), etc. can be determined by whether the FOLL_PIN macro has been defined by #include <linux/mm.h>.

Putting all that together, the following functions may be used to copy between user memory and iomem:

#include <linux/kernel.h>
#include <linux/uaccess.h>
#include <linux/mm.h>
#include <linux/highmem.h>
#include <linux/io.h>

/**
 * my_copy_to_user_from_iomem - copy to user memory from MMIO
 * @to:     destination in user memory
 * @from:   source in remapped MMIO
 * @n:      number of bytes to copy
 * Context: process
 *
 * Returns number of uncopied bytes.
 */
long my_copy_to_user_from_iomem(void __user *to, const void __iomem *from,
                unsigned long n)
{
    might_fault();
    if (!access_ok(to, n))
        return n;
    while (n) {
        enum { PAGE_LIST_LEN = 32 };
        struct page *page_list[PAGE_LIST_LEN];
        unsigned long start;
        unsigned int p_off;
        unsigned int part_len;
        int nr_pages;
        int i;

        /* Determine pages to do this iteration. */
        p_off = offset_in_page(to);
        start = (unsigned long)to - p_off;
        nr_pages = min_t(int, PAGE_ALIGN(p_off + n) >> PAGE_SHIFT,
                 PAGE_LIST_LEN);
        /* Lock down (for write) user pages. */
#ifdef FOLL_PIN
        nr_pages = pin_user_pages_fast(start, nr_pages, FOLL_WRITE, page_list);
#else
        nr_pages = get_user_pages_fast(start, nr_pages, FOLL_WRITE, page_list);
#endif
        if (nr_pages <= 0)
            break;

        /* Limit number of bytes to end of locked-down pages. */
        part_len =
            min(n, ((unsigned long)nr_pages << PAGE_SHIFT) - p_off);

        /* Copy from iomem to locked-down user memory pages. */
        for (i = 0; i < nr_pages; i++) {
            struct page *page = page_list[i];
            unsigned char *p_va;
            unsigned int plen;

            plen = min((unsigned int)PAGE_SIZE - p_off, part_len);
            p_va = kmap_atomic(page);
            memcpy_fromio(p_va + p_off, from, plen);
            kunmap_atomic(p_va);
#ifndef FOLL_PIN
            set_page_dirty_lock(page);
            put_page(page);
#endif
            to = (char __user *)to + plen;
            from = (const char __iomem *)from + plen;
            n -= plen;
            part_len -= plen;
            p_off = 0;
        }
#ifdef FOLL_PIN
        unpin_user_pages_dirty_lock(page_list, nr_pages, true);
#endif
    }
    return n;
}
    
/**
 * my_copy_from_user_to_iomem - copy from user memory to MMIO
 * @to:     destination in remapped MMIO
 * @from:   source in user memory
 * @n:      number of bytes to copy
 * Context: process
 *
 * Returns number of uncopied bytes.
 */
long my_copy_from_user_to_iomem(void __iomem *to, const void __user *from,
                unsigned long n)
{
    might_fault();
    if (!access_ok(from, n))
        return n;
    while (n) {
        enum { PAGE_LIST_LEN = 32 };
        struct page *page_list[PAGE_LIST_LEN];
        unsigned long start;
        unsigned int p_off;
        unsigned int part_len;
        int nr_pages;
        int i;

        /* Determine pages to do this iteration. */
        p_off = offset_in_page(from);
        start = (unsigned long)from - p_off;
        nr_pages = min_t(int, PAGE_ALIGN(p_off + n) >> PAGE_SHIFT,
                 PAGE_LIST_LEN);
        /* Lock down (for read) user pages. */
#ifdef FOLL_PIN
        nr_pages = pin_user_pages_fast(start, nr_pages, 0, page_list);
#else
        nr_pages = get_user_pages_fast(start, nr_pages, 0, page_list);
#endif
        if (nr_pages <= 0)
            break;

        /* Limit number of bytes to end of locked-down pages. */
        part_len =
            min(n, ((unsigned long)nr_pages << PAGE_SHIFT) - p_off);

        /* Copy from locked-down user memory pages to iomem. */
        for (i = 0; i < nr_pages; i++) {
            struct page *page = page_list[i];
            unsigned char *p_va;
            unsigned int plen;

            plen = min((unsigned int)PAGE_SIZE - p_off, part_len);
            p_va = kmap_atomic(page);
            memcpy_toio(to, p_va + p_off, plen);
            kunmap_atomic(p_va);
#ifndef FOLL_PIN
            put_page(page);
#endif
            to = (char __iomem *)to + plen;
            from = (const char __user *)from + plen;
            n -= plen;
            part_len -= plen;
            p_off = 0;
        }
#ifdef FOLL_PIN
        unpin_user_pages(page_list, nr_pages);
#endif
    }
    return n;
}

Secondly, you might be able to speed up memory access by mapping the iomem as "write combined" by replacing pci_iomap() with pci_iomap_wc().

Thirdly, the only real way to avoid wait-stating the CPU when accessing slow memory is to not use the CPU and use DMA transfers instead. The details of that very much depend on your PCIe device's bus-mastering DMA capabilities (if it has any at all). User memory pages still need to be pinned (e.g. by get_user_pages_fast() or pin_user_pages_fast() as appropriate) during the DMA transfer, but do not need to be temporarily mapped by kmap_atomic().

How procfs outputs /proc/iomem?

struct resource iomem_resource is what you're looking for, and it is defined and initialized in kernel/resource.c (via proc_create_seq_data()). In the same file, the instance struct seq_operations resource_op defines what happens when you, for example cat the file from userland.
iomem_resource is a globally exported symbol, and is used throughout the kernel, drivers included, to request resources. You can find instances scattered across the kernel of devm_/request_resource() which take either iomem_resource or its sibling ioport_resource based on either fixed settings, or based on configurations. Examples of methods that take configurations are a) device trees which is prevalent in embedded settings, and b) E820 or UEFI, which can be found more on x86.

Starting with b) which was asked in the question, the file arch/x86/kernel/e820.c shows examples of how reserved memory gets inserted into /proc/iomem via insert_resource().
This excellent link has more details on the dynamics of requesting memory map details from the BIOS.

Another alternative sequence (which relies on CONFIG_OF) for how a device driver requests the needed resources is:

The Open Firmware API is traversing the device tree, and finds a matching driver. For example via a struct of_device_id.
The driver defines a struct platform_device which contains both the struct of_device_id and a probe function. This probing function is thus called.
Inside the probe function, a call to platform_get_resource() is made which reads the reg property from the device tree. This property defines the physical memory map for a specific device.
A call to devm_request_mem_region() is made (which is just a call to request_region()) to actually allocate the resources and add it to /proc/iomem.

What is the benefit of calling ioread functions when using memory mapped IO

You need ioread8 / iowrite8 or whatever to at least cast to volatile* to make sure optimization still results in exactly 1 access (not 0 or more than 1). In fact they do more than that, handling endianness (They also handle endianness, accessing device memory as little-endian. Or ioread32be for big-endian) and some compile-time reordering memory-barrier semantics that Linux chooses to include in these functions. And even a runtime barrier after reads, because of DMA. Use the _rep version to copy a chunk from device memory with only one barrier.

In C, data races are UB (Undefined Behaviour). This means the compiler is allowed to assume that memory accessed through a non-volatile pointer doesn't change between accesses. And that if (x) y = *ptr; can be transformed into tmp = *ptr; if (x) y = tmp; i.e. compile-time speculative loads, if *ptr is known to not fault. (Related: Who's afraid of a big bad optimizing compiler? re: why the Linux kernel need volatile for rolling its own atomics.)

MMIO registers may have side effects even for reading so you must stop the compiler from doing loads that aren't in the source, and must force it to do all the loads that are in the source exactly once.

Same deal for stores. (Compilers aren't allowed to invent writes even to non-volatile objects, but they can remove dead stores. e.g. *ioreg = 1; *ioreg = 2; would typically compile the same as *ioreg = 2; The first store gets removed as "dead" because it's not considered to have a visible side effect.

C volatile semantics are ideal for MMIO, but Linux wraps more stuff around them than just volatile.

From a quick look after googling ioread8 and poking around in https://elixir.bootlin.com/linux/latest/source/lib/iomap.c#L11 we see that Linux I/O addresses can encode IO address space (port I/O, aka PIO; in / out instructions on x86) vs. memory address space (normal load/store to special addresses). And ioread* functions actually check that and dispatch accordingly.

    /*
     * Read/write from/to an (offsettable) iomem cookie. It might be a PIO
     * access or a MMIO access, these functions don't care. The info is
     * encoded in the hardware mapping set up by the mapping functions
     * (or the cookie itself, depending on implementation and hw).
     *
     * The generic routines don't assume any hardware mappings, and just
     * encode the PIO/MMIO as part of the cookie. They coldly assume that
     * the MMIO IO mappings are not in the low address range.
     *
     * Architectures for which this is not true can't use this generic
     * implementation and should do their own copy.
     */

For example implementation, here's ioread16. (IO_COND is a macro that checks the address against a predefined constant: low addresses are PIO addresses).

    unsigned int ioread16(void __iomem *addr)
    {
      IO_COND(addr, return inw(port), return readw(addr));
      return 0xffff;
    }

What would break if you just cast the `ioremap` result to `volatile uint32_t*`?

e.g. if you used READ_ONCE / WRITE_ONCE which just cast to volatile unsigned char* or whatever, and are used for atomic access to shared variables. (In Linux's hand-rolled volatile + inline asm implementation of atomics which it uses instead of C11 _Atomic).

That might actually work on some little-endian ISAs like x86 if compile-time reordering wasn't a problem, but others need more barriers. If you look at the definition of readl (which ioread32 uses for MMIO, as opposed to inl for PIO), it uses barriers around a dereference of a volatile pointer.

(This and the macros this uses are defined in the same io.h as this, or you can navigate using the LXR links: every identifier is a hyperlink.)

static inline u32 readl(const volatile void __iomem *addr) {
    u32 val;
    __io_br();
    val = __le32_to_cpu(__raw_readl(addr));
    __io_ar(val);
    return val;
}

The generic __raw_readl is just the volatile dereference; some ISAs may provide their own.

__io_ar() uses rmb() or barrier() After Read. /* prevent prefetching of coherent DMA data ahead of a dma-complete */. The Before Read barrier is just barrier() - blocking compile-time reordering without asm instructions.

Old answer to the wrong question: the text below answers why you need to call ioremap.

Because it's a physical address and kernel memory isn't identity-mapped (virt = phys) to physical addresses.

And returning a virtual address isn't an option: not all systems have enough virtual address space to even direct-map all of physical address space as a contiguous range of virtual addresses. (But when there is enough space, Linux does do this, e.g. x86-64 Linux's virtual address-space layout is documented in x86_64/mm.txt

Notably 32-bit x86 kernels on systems with more than 1 or 2GB of RAM (depending on how the kernel is configured: 2:2 or 1:3 kernel:user split of virtual address space). With PAE for 36-bit physical address space, a 32-bit x86 kernel can use much more physical memory than it can map at once. (This is pretty horrible and makes life difficult for a kernel: some random blog reposed Linus Torvald's comments about how PAE really really sucks.)

Other ISAs may have this too, and IDK what Alpha does about IO memory when byte accesses are needed; maybe the region of physical address space that maps word loads/stores to byte loads/stores is handled earlier so you request the right physical address. (http://www.tldp.org/HOWTO/Alpha-HOWTO-8.html)

But 32-bit x86 PAE is obviously an ISA that Linux cares a lot about, even quite early in the history of Linux.

address_space() definition as a Sparse annotation in the linux kernel

address_space() and noderef are attributes. What may be confusing is that they are not GCC attributes. They are Sparse attributes and so they only are meaningful to Sparse when __CHECKER__ is defined and Sparse is enabled.

address_space() attribute puts a specific restriction on a pointer marking it as such. My understanding is that the numbers that are arguments were chosen arbitrarily and denote that a pointer belongs to a certain class. Thus, if you follow the rules, you should clearly annotate a pointer as belonging to a specific class (such as __user or __iomem, etc) and not mix pointers from different classes. Using these annotations, Sparse as a static checker helps you to spot cases of incorrect usage.

For address_space(2) aka __iomem I have found a good description here: What is the use of __iomem in linux while writing device drivers? The post from Linus is a excellent description as well.

Besides {0,1,2,3} there's also {4} which marks __rcu pointers as a separate class. I don't think there are more at this time.

how to read a register in device driver?

I gave a fixed value 0x78789a9a to SMMU IDR[2] register. (at offset 0x8, 32 bit register. This is possible because it's qemu.)
SMMU starts at 0x09050000 and it has address space 0x20000.

__iomem uint32_t *addr1 = NULL;

static int __init my_driver_init(void)
{
...
addr1 = ioremap(0x09050000, 0x20000); // smmuv3
printk("SMMU_IDR[2]     : 0x%X\n", readl(addr1 +0x08/4));
..}

This is the output when the driver is initialized.(The value is read ok)

[  453.207261] SMMU_IDR[2]     : 0x78789A9A

The first problem was that the access width was wrong for that address. Before, it was defined as uint8_t *addr1; and I used printk("SMMU_AIDR : 0x%X\n", *(addr1 + 0x1c)) so it was reading byte when it was not allowed by the SMMU model.

Second problem (I think this didn't cause the trap because arm64 provides memory mapped io) was that I used memory access(pointer dereferencing) for memory mapped IO registers. As people commented, I should have used readl function. (Mainly because to make the code portable. readl works also for iomap platforms like x86_64. using the mmio adderss as pointer will not work on such platforms. I later found that readl function takes care of the memory barrier problem too).
ADD : I fixed volatile to __iomem for variable addr1.(thanks @0andriy)

What Is the Use of _Iomem in Linux While Writing Device Drivers