How Are Percpu Pointers Implemented in the Linux Kernel

How are percpu pointers implemented in the Linux kernel?

Normal global variables are not per CPU. Automatic variables are on the stack, and different CPUs use different stack, so naturally they get separate variables.

I guess you're referring to Linux's per-CPU variable infrastructure.

Most of the magic is here (asm-generic/percpu.h):

extern unsigned long __per_cpu_offset[NR_CPUS];

#define per_cpu_offset(x) (__per_cpu_offset[x])

/* Separate out the type, so (int[3], foo) works. */
#define DEFINE_PER_CPU(type, name) \
    __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name

/* var is in discarded region: offset to particular copy we want */
#define per_cpu(var, cpu) (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
#define __get_cpu_var(var) per_cpu(var, smp_processor_id())

The macro RELOC_HIDE(ptr, offset) simply advances ptr by the given offset in bytes (regardless of the pointer type).

What does it do?

When defining DEFINE_PER_CPU(int, x), an integer __per_cpu_x is created in the special .data.percpu section.
When the kernel is loaded, this section is loaded multiple times - once per CPU (this part of the magic isn't in the code above).
The __per_cpu_offset array is filled with the distances between the copies. Supposing 1000 bytes of per cpu data are used, __per_cpu_offset[n] would contain 1000*n.
The symbol per_cpu__x will be relocated, during load, to CPU 0's per_cpu__x.
__get_cpu_var(x), when running on CPU 3, will translate to *RELOC_HIDE(&per_cpu__x, __per_cpu_offset[3]). This starts with CPU 0's x, adds the offset between CPU 0's data and CPU 3's, and eventually dereferences the resulting pointer.

Address of per-cpu variable

Okay, I figured out the offset for this particular symbol. This one is exported by the kernel. Hence there is an entry in /proc/kallsyms

000000000000cbc0 D per_cpu__current_task

So the offset is 0xcbc0 for this particular variable. Of course the offset would vary for other versions.

The implementation of Linux kernel current macro

The correct header to use is asm/current.h, do not use asm-generic. This applies to anything under asm really. Headers in the asm-generic folder are provided (as the name suggests) as a "generic" default implementation of macros/functions, then each architecture /arch/xxx has its own asm include folder, where if needed it can define the same macros/functions in an architecture-specific way.

This is done both because it could be actually needed (some archs might have an implementation that is not compatible with the generic one) and for performance since there might be a better and more optimized way of achieving the same result under a specific arch.

Indeed, if we look at how each arch defines get_current() or get_current_thread_info() we can see that some of them (e.g. alpha, spark) keep a reference to the current task in the thread_info struct and keep a pointer to the current thread_info in a register for performance. Others directly keep a pointer to current in a register (e.g. powerpc 32bit), and others define a global per-cpu variable (e.g. x86). On x86 in particular, the thread_info struct doesn't even have a pointer to the current task, it's a very simple 16-byte structure made to fit in a cache line for performance.

// example from /arch/powerpc/include/asm/current.h

/*
 * We keep `current' in r2 for speed.
 */
register struct task_struct *current asm ("r2");

How could I make sure which version the Linux kernel really use?

Well, let's just take a simple look:

$ rg '#include.+current\.h' | cat
security/landlock/ptrace.c:#include <asm/current.h>       
security/landlock/syscalls.c:#include <asm/current.h>     
sound/pci/rme9652/hdsp.c:#include <asm/current.h>         
sound/pci/rme9652/rme9652.c:#include <asm/current.h>      
net/ipv4/raw.c:#include <asm/current.h>                 
net/core/dev.c:#include <asm/current.h>                   
ipc/msg.c:#include <asm/current.h>                        
fs/quota/quota.c:#include <asm/current.h>                 
drivers/staging/media/atomisp/pci/hmm/hmm_bo.c:#include <asm/current.h>                                             
fs/jfs/ioctl.c:#include <asm/current.h>                   
fs/hugetlbfs/inode.c:#include <asm/current.h>             
drivers/parport/daisy.c:#include <asm/current.h>          
...

As you can see asm/current.h is the only header actually used.

We can also see that (as of v5.14 at least) only arc seems to be using the "generic" version:

$ rg '#include.+generic.+current\.h' | cat     
arch/arc/include/asm/current.h:#include <asm-generic/current.h>

many blogs or books says x86 use asm-generic version to implement current macro, including Linux Kernel Development, 3rd

I can only speculate that these resources were written a long time ago and based on pretty old kernel versions, which at the time of writing might have used a different include system (maybe x86 used to use the generic version as well). If not, then those resources are most probably wrong.

How is for_each_possible_cpu expanded in cpufreq.c file?

How does it expand:

Keep in mind anytime you ask something like that wrt the Linux kernel, the answer is never easy... so... here we go:

#define for_each_possible_cpu(cpu) for_each_cpu((cpu), cpu_possible_mask)

You can see that this macro is really just a for loop, called with an iterator as cpu the for_each_cpu is another macro which is the looping part defined as:

#define for_each_cpu(cpu, mask)                 \
     for ((cpu) = 0; (cpu) < 1; (cpu)++, (void)mask)

And the cpu_possible_mask is a pointer to a struct:

extern const struct cpumask *const cpu_possible_mask;

Which is seen here (consisting of another macro):

typedef struct cpumask { DECLARE_BITMAP(bits, NR_CPUS); } cpumask_t;

That contains another macro (DECLARE_BITMAP) and it has another #define for NR_CPUS, that is the number of CPUs in the system, it should be system dependent and set in the kconfig. The macro in there is really just an array and an accessor:

#define DECLARE_BITMAP(name,bits) \
      unsigned long name[BITS_TO_LONGS(bits)]

So you can see that's the array and the accessor which of course consists of another #define:

#define BITS_TO_LONGS(nr)       DIV_ROUND_UP(nr, BITS_PER_BYTE * sizeof(long))

...which consists of two more #defines:

#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
#define BITS_PER_BYTE 8

Anyway... you can see that (A) this is a mess and (B) it ends up being a for loop that increments number of CPUs but also issues a second iterative action via the comma operator. How exactly the second operator words itself out is system dependent. (what's the sizeof a long on your system? what's the number of cpus on your system?)

2.Why are two others #defines are called inside?

That's kind of answered by #1. Since it expands to a for loop, you need that loop to do something.

3.Why is the per_cpu output equated to -1?

The per_cpu macro is giving a pointer to the CPU frequency policy of each CPU in the system, that is being initialized to -1. I'd have to do more research to be sure, but presumably they picked that because of the define:

#define CPUFREQ_ETERNAL                 (-1)

And the __init_rwsem is an architecture defined way of initializing the read/write semaphore used for each CPU's policy.

I don't know if that explanation helped much, but at least maybe it helps point you in a better direction. Good luck exploring the kernel.

the use of double pointer (type **)ptr in __register_chrdev_region (linux kernel, 4.14.13, x86_64)

Because several lines below there is an assignment to *cp:

cd->next = *cp;
*cp = cd;

Access current_task pointer of another cpu in a SMP based linux system

I finally figured out what was wrong.
The difference between __preempt_count and current_task is that first one is defined as an int variable, whereas the 2nd one as a pointer to structure. In other words 1st one is defined as a variable and the 2nd one as a pointer.

Now, looking deeper into per cpu variables, they are just variables allocated by the compiler in separate memory locations, like an array. When per_cpu_ptr for a variable Foo is called, then the macro computes something like Foo[cpu], but that means the per_cpu_ptr needs the actual base address of the variable, meaning the & so that it can compute the relative address value starting from this.

When declaring: foo = per_cpu_ptr(&__preempt_count,cpu) , this address is already given = &__preempt_count

When declaring: bar = per_cpu_ptr(current_task,cpu), this address is not given, as the & is missing here. The current_task is a pointer but not the base address of the current_task array.

In both above cases the argument to per_cpu_ptr is a pointer, but here my understanding was wrong, it was not clear to me what is actually the pointer of the variable I need to pass, now it's clear: I have to pass the base address of the variable(var or pointer doesn't matter) so that the macro can compute the relative address for that cpu.

Therefore the right approaches that work are:

bar = per_cpu(current_task,cpu) which translates into *per_cpu_var(¤t_task,cpu)

or directly

bar = *per_cpu_var(¤t_task,cpu);

Understanding the getting of task_struct pointer from process kernel stack

The kernel stack contains a special struct at the top -- thread_info:

 26 struct thread_info {
 27         struct task_struct      *task;          /* main task structure */
 28         struct exec_domain      *exec_domain;   /* execution domain */
 29         __u32                   flags;          /* low level flags */
 30         __u32                   status;         /* thread synchronous flags */
 31         __u32                   cpu;            /* current CPU */
 32         int                     preempt_count;  /* 0 => preemptable,
 33                                                    <0 => BUG */
 34         mm_segment_t            addr_limit;
 35         struct restart_block    restart_block;
 36         void __user             *sysenter_return;
 37 #ifdef CONFIG_X86_32
 38         unsigned long           previous_esp;   /* ESP of the previous stack in
 39                                                    case of nested (IRQ) stacks
 40                                                 */
 41         __u8                    supervisor_stack[0];
 42 #endif
 43         unsigned int            sig_on_uaccess_error:1;
 44         unsigned int            uaccess_err:1;  /* uaccess failed */
 45 };

So, to get the task_struct you'll need to get a thread_info pointer with GET_THREAD_INFO from the ASM-code:

183 /* how to get the thread information struct from ASM */
184 #define GET_THREAD_INFO(reg)     \
185         movl $-THREAD_SIZE, reg; \
186         andl %esp, reg

... or with current_thread_info from the C-code:

174 /* how to get the thread information struct from C */
175 static inline struct thread_info *current_thread_info(void)
176 {
177         return (struct thread_info *)
178                 (current_stack_pointer & ~(THREAD_SIZE - 1));
179 }

Note that THREAD_SIZE defined as (PAGE_SIZE << THREAD_SIZE_ORDER) and THREAD_SIZE_ORDER equals 1 for both x86_32 and x86_64 so THREAD_SIZE results to 8192 (2^13 or 1<<13).

Can't find where preempt count in the stack is declared for a percpu variable access. (linux kernel)

this_cpu_read(printk_context) expands to:

⇒ __pcpu_size_call_return(this_cpu_read_, printk_context)

⇒

({
    typeof(printk_context) pscr_ret__;
    __verify_pcpu_ptr(&(printk_context));
    switch(sizeof(printk_context)) {
    case 1: pscr_ret__ = this_cpu_read_1(printk_context); break;
    case 2: pscr_ret__ = this_cpu_read_2(printk_context); break;
    case 4: pscr_ret__ = this_cpu_read_4(printk_context); break;
    case 8: pscr_ret__ = this_cpu_read_8(printk_context); break;
    default:
        __bad_size_call_parameter(); break;
    }
    pscr_ret__;
})

sizeof(printk_context) is 4, so pscr_ret__ = this_cpu_read_4(printk_context);.

The this_cpu_read_4() macro is defined by #include <asm/percpu.h>:

==== arch/arm64/include/asm/percpu.h ====

#define this_cpu_read_4(pcp)        \
    _pcp_protect_return(__percpu_read_32, pcp)

#define _pcp_protect_return(op, pcp, args...)               \
({                                  \
    typeof(pcp) __retval;                       \
    preempt_disable_notrace();                  \
    __retval = (typeof(pcp))op(raw_cpu_ptr(&(pcp)), ##args);    \
    preempt_enable_notrace();                   \
    __retval;                           \
})

That is where the preempt count manipulation occurs.

The preempt_disable_notrace() and preempt_enable_notrace() macros are defined by #include <linux/preempt.h>.

==== include/linux/preempt.h ====

#define preempt_enable_notrace() \
do { \
    barrier(); \
    __preempt_count_dec(); \
} while (0)

#define preempt_disable_notrace() \
do { \
    __preempt_count_inc(); \
    barrier(); \
} while (0)

#define __preempt_count_inc() __preempt_count_add(1)
#define __preempt_count_dec() __preempt_count_sub(1)

__preempt_count_add() and __preempt_count_sub() are defined by #include <asm/preempt.h>.

==== arch/arm64/include/asm/preempt.h ====

static inline void __preempt_count_add(int val)
{
    u32 pc = READ_ONCE(current_thread_info()->preempt.count);
    pc += val;
    WRITE_ONCE(current_thread_info()->preempt.count, pc);
}

static inline void __preempt_count_sub(int val)
{
    u32 pc = READ_ONCE(current_thread_info()->preempt.count);
    pc -= val;
    WRITE_ONCE(current_thread_info()->preempt.count, pc);
}

For arm64, CONFIG_THREAD_INFO_IN_TASK is enabled so current_thread_info() is defined as a macro by #include <linux/thread_info.h>.

==== include/linux/thread_info.h ====

#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
 * For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the
 * definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels,
 * including <asm/current.h> can cause a circular dependency on some platforms.
 */
#include <asm/current.h>
#define current_thread_info() ((struct thread_info *)current)
#endif

The current macro is defined by #include <asm/current.h>.

==== arch/arm64/include/asm/current.h ====

#define current get_current()

/*
 * We don't use read_sysreg() as we want the compiler to cache the value where
 * possible.
 */
static __always_inline struct task_struct *get_current(void)
{
    unsigned long sp_el0;

    asm ("mrs %0, sp_el0" : "=r" (sp_el0));

    return (struct task_struct *)sp_el0;
}

There is some magic in arch/arm64/kernel/entry.S relating to the use of the sp_el0 stack pointer to point to the current thread_info / task_struct. Sorry, I do not have time to study the gory details, but it was introduced by commit 6cdf9c7ca687 ("arm64: Store struct thread_info in sp_el0").

The key thing is that sp_el0 register is not the same as sp. The kernel does not run in EL0 mode so sp_el0 is available as a "scratch" register. The kernel uses it to point to the current thread_info / task_struct.

struct task_struct is defined by #include <linux/sched.h>.

==== include/linux/sched.h ====

struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
    /*
     * For reasons of header soup (see current_thread_info()), this
     * must be the first element of task_struct.
     */
    struct thread_info      thread_info;
#endif
    /* -1 unrunnable, 0 runnable, >0 stopped: */
    volatile long           state;

Since CONFIG_THREAD_INFO_IN_TASK is selected, the first member is struct thread_info thread_info. current_thread_info() points to that member in the current task.

struct thread_info is defined by #include <asm/thread_info.h>.

==== arch/arm64/include/asm/thread_info.h ====

/*
 * low level task data that entry.S needs immediate access to.
 */
struct thread_info {
    unsigned long       flags;      /* low level flags */
    mm_segment_t        addr_limit; /* address limit */
#ifdef CONFIG_ARM64_SW_TTBR0_PAN
    u64         ttbr0;      /* saved TTBR0_EL1 */
#endif
    union {
        u64     preempt_count;  /* 0 => preemptible, <0 => bug */
        struct {
#ifdef CONFIG_CPU_BIG_ENDIAN
            u32 need_resched;
            u32 count;
#else
            u32 count;
            u32 need_resched;
#endif
        } preempt;
    };
};

When CONFIG_ARM64_SW_TTBR0_PAN is not selected and the CPU is little-endian, the preempt.count member will be at offset 16 from the start of the structure.

How Are Percpu Pointers Implemented in the Linux Kernel