Access Processor Interrupts with C++ and X86 and X64 Architectures

Access processor interrupts with c++ and x86 and x64 architectures

For user-level processes, interrupts are replaced by signals. You can arrange to have a signal sent to your process by calling setitimer. But most likely, the best way to do what you're trying to do is one of two things:

  1. Use an event loop. Basically, write your program as a giant loop that periodically checks to see if there's anything it needs to do. In the loop, check the time, check for keypresses, and so on. Do a little bit of work on whatever you need to do, and loop again.

  2. Use threads. You can have a thread just to watch the time and trigger timer jobs. You can have a thread that blocks on a read to act like an interrupt when data arrives.

Likely it was drilled into your head that you do minimal work in the interrupt handler itself, typically just passing on information to other code that runs in a normal context. Well, the OS already does that part for you. You just have to write the code that waits for the interrupt handler (or whatever is needed) to detect and begin processing the event.

I want to have a thread in c++ running a code that constantly verifies the key being pressed at the moment without pressing the enter key (cin, getchar, etc).

So do that. That requires a thread and it requires changing the terminal's input mode to one that doesn't require an enter key. That has nothing to do with interrupts.

Location of interrupt handling code in Linux kernel for x86 architecture

I am looking for the code that pushes all of the general purpose registers on the stack

Hardware stores the current state (which includes registers) before executing an interrupt handler. Code is not involved. And when the interrupt exits, the hardware reads the state back from where it was stored.

Now, code inside the interrupt handler may read and write the saved copies of registers, causing different values to be restored as the interrupt exits. That's how a context switch works.


On x86, the hardware only saves those registers that change before the interrupt handler starts running. On most embedded architectures, the hardware saves all registers. The reason for the difference is that x86 has a huge number of registers, and saving and restoring any not modified by the interrupt handler would be a waste. So the interrupt handler is responsible to save and restore any registers it voluntarily uses.

See Intel® 64 and IA-32 Architectures
Software Developer’s Manual, starting on page 6-15.

Intel x86 - Interrupt Service Routine responsibility

You're responsible knowing how and why the processor will call your interrupt service routines and writing code for your ISRs accordingly. You're trying to treat an exception generated by a division by zero error as if it were generated by a hardware interrupt. However this is not how Intel x86 processors handle these kind of exceptions.

How x86 processors handle interrupt and exceptions

There several different kinds of events that will result in the processor invoking an interrupt service routine given in the interrupt vector table. Collectively these are called interrupts and exceptions, and there are three different ways the processor can handle an interrupt or exception, as a fault, as a trap, or as an abort. Your divide instruction generates a Divide Error (#DE) exception, which is handled as a fault. Hardware and software interrupts are handled as traps, while other kinds of exceptions are handled as one of these three ways, depending on the source of the exception.

Faults

The processor handles an exception as a fault if the nature of the exception allows for it to be corrected in some way. Because of this, the return address pushed on the stack points at the instruction that generated the exception so the the fault handler knows what exact instruction caused the fault and to make it possible to resume execution of the faulting instruction after fixing the problem. A Page Fault (#PF) exception is a good example of this. It can be used to implement virtual memory by having the fault handler provide a valid virtual mapping for the address that the faulting instruction tried to access. With a valid page mapping in place the instruction can be resumed and executed without generating another page fault.

Traps

Interrupts and certain kinds of exceptions, all of them software exceptions, are handled as traps. Traps don't imply an error in execution of a instruction. Hardware interrupts occur in between the execution of instructions, and software interrupts and certain software exceptions effectively mimic this behaviour. Traps are handled by pushing the address of next instruction would have been normally executed. This allows the trap handler to resume the normal execution of the interrupted code.

Aborts

Serious and unrecoverable errors are handled as aborts. There only two exceptions that generate aborts, the Machine Check (#MC) exception and the Double Fault (#DF). Machine check instructions are the result of hardware failure in the processor itself being detected, this can't be fixed, and normal execution can't be reliably resumed. Double fault exceptions happen when a exception occurs during the handling of an interrupt or an exception. This leaves the CPU in an inconsistent state, somewhere in the middle of all many necessary steps to invoke an ISR, one that cannot be resumed. The return value pushed on the stack may or may not have anything to with whatever caused the abort.

How divide error exceptions are normally handled

Normally, most operating systems handle divide error exceptions by passing it along to a handler in the executing process to handle, or failing that by terminating the process, indicating that it had crashed. For example, most Unix systems send a SIGFPE signal to the process, while Windows does something similar using its Structured Exception Handling mechanism. This is so the process's programming language runtime can set up its own handler to implement whatever behaviour is necessary for the programming language being used. Since division by zero results in undefined behaviour in C and C++, crashing is an acceptable behaviour, so these languages don't normally install a divide by zero handler.

Note that while you could handle divide error exceptions by "incrementing EIP", this is harder than you might think and doesn't produce a very useful result. You can't just add one or some other constant value to EIP, you need to skip over entire instruction which could be anywhere from 2 to 15 bytes long. There's three instructions that can cause this exception, AAM, DIV and IDIV, and these can be encoded with various prefixes and operand bytes. You'll need decode the instruction to figure out how long it is. The result performing this increment will be as if the instruction was never executed. The faulting instruction won't calculate a meaningful value and you'll have no indication why the program isn't behaving correctly.

Read the documentation

If you're writing your own operating system then you'll need to have the Intel Software Developer's Manual available so you can consult it often. In particular you'll need to read and learn pretty much everything in Volume 3: System Programming Guide, excluding the Virtual Machine Extension chapters and everything afterwards. Everything you need to know about how interrupts and exceptions is covered in detail there, plus a lot of other things you'll need to know.

Why didn't x86 implement direct core-to-core messaging assembly/cpu instructions?

CPUs evolved for a very different programming model than GPUs, to run multiple separate threads, potentially of different processes, so you'd also need software and OS infrastructure to let threads know which other core (if any) some other thread was running on. Or they'd have to pin each thread to a specific core. But even then it would need some way to virtualize the architectural message-passing register, the same way context switches virtualize the standard registers for multi-tasking on each core.

So there's an extra hurdle before anything like this could even be usable at all under a normal OS, where a single process doesn't take full ownership of the physical cores. The OS is still potentially scheduling other threads of other processes onto cores, and running interrupt handlers, unlike a GPU where cores don't have anything else to do and are all build to work together on a massively parallel problem.

Intel did introduce user-space interrupts in Sapphire Rapids. Including user IPI (inter-processor interrupt), but that doesn't involve a receive queue that would have to get saved/restored on context switch. The OS still has to manage some stuff (like the User Interrupt Target table), but it's not as problematic for context switches, I think. It's solving a different problem than what you're suggesting, since it's an interrupt not a message queue.

Notifying another thread when to look in shared memory for data is the hard part of the problem that needed solving, moreso than getting data between cores. Shared memory is still ok for that (especially with new instructions like cldemote to let the writer ask for a recently-stored cache line to be written back to shared L3 where other cores can read it more efficiently). See the section below about UIPIs.


A task that wants something like this is usually best done on a GPU anyway, not a few separate deeply pipelined OoO exec CPU cores that are trying to do speculative execution. Unlike GPUs that are simple in-order pipelines.

You couldn't actually push a result to another core until it retires on the core executing it. Because you don't want to have to roll back the other core as well if you discover a mis-speculation such as a branch mispredict earlier in the path of execution leading to this. That could conceivably still allow for something lower-latency than bouncing a cache-line between cores for shared memory, but it's a pretty narrow class of application that can use it.

However, high-performance computing is a known use-case for modern CPUs, so if it was really a game-changer it would be worth considering as a design choice, perhaps.

Some things aren't easy or possible to do efficiently given the architecture of existing CPUs. Small units of work with fine-grained cooperation between threads is a problem. Your idea might help if it was practical to implement, but there are major challenges.



Inter-processor Interrupts (IPI), including user-IPI

For OS use, there is the IPI mechanism. But that triggers an interrupt so it doesn't line up data to be read when the receive side's out-of-order exec pipeline gets to it, so it's very different from the mechanism you're suggesting, for different use-cases.

It's quite low-performance except to avoid polling by the other side. And to be able to wake up a core from a power-saving sleep state, if more threads are now ready to run so it should wake up can call schedule() to figure out which one to run.

Any core can send an IPI to any other, if it's running in kernel mode.

New in Sapphire Rapids, there is hardware support for the OS letting a user-space process handle some interrupts fully in user-space.

https://lwn.net/Articles/869140/ is an LKML post explaining it and how Linux could support it. Apparently it's about 10x faster than "eventfd" for ping-ponging a tiny message between two user-space threads a million times. Or 15x faster than doing the same with a POSIX signal handler.

Kernel managed architectural data structures

  • UPID: User Posted Interrupt Descriptor - Holds receiver interrupt vector
    information and notification state (like an ongoing notification, suppressed
    notifications).
  • UITT: User Interrupt Target Table - Stores UPID pointer and vector information
    for interrupt routing on the sender side. Referred by the senduipi instruction.

The interrupt state of each task is referenced via MSRs which are saved and
restored by the kernel during context switch.

Instructions

  • senduipi <index> - send a user IPI to a target task based on the UITT index.
  • clui - Mask user interrupts by clearing UIF (User Interrupt Flag).
  • stui - Unmask user interrupts by setting UIF.
  • testui - Test current value of UIF.
  • uiret - return from a user interrupt handler.

So it does have new state to be saved/restored on context switch. I suspect it's smaller than the queues you might have been picturing, though. And critically, doesn't need a receive queue anywhere for threads that aren't running, because the state involves tables in memory and there's no data, just a pending or not flag I guess. So in the not-running receiver case, it can just set a bit in a table that senders need to be able to see anyway, for the HW to know where to direct the UIPI. Unlike needing to find the kernel's saved register-state or some other space and append to a variable-size(?) buffer for your idea.

If the receiver is running (CPL=3), then the user interrupt is delivered
directly without a kernel transition. If the receiver isn't running the
interrupt is delivered when the receiver gets context switched back.
If the
receiver is blocked in the kernel, the user interrupt is delivered to the
kernel which then unblocks the intended receiver to deliver the interrupt.

So data still has to go through memory, this is just about telling another thread when to look so it doesn't have to poll / spin.

I think UIPIs are useful for different use-cases than your proposed message queue.

So you still wouldn't generally use this when the receiving thread knows that specific data is coming soon. Except maybe to let a thread be working on something independent instead of spin-waiting or sleeping.

It's also usable if the thread doesn't specifically expect data soon, unlike your queue idea. So you can have it working on something low priority, but then start work that's part of the critical path as soon as more of that is ready.

It's still an interrupt, so significant overhead is still expected, just a lot less than going through the kernel for a signal handler or similar. I think like any interrupt, it would need to drain the out-of-order back-end. Or maybe not even that bad, maybe just treat it like a branch mispredict since it doesn't have to change privilege level. Discarding instructions in the ROB would be lower interrupt latency, but worse throughput and just re-steering the front-end to the interrupt-handler address.



Incorrect scalability assumptions in your question

scaling for various algorithms that are bottlenecked on core-to-core bandwidth (and/or latency)?

Mesh interconnects (like Intel since Skylake Xeon) allow pretty large aggregate bandwidth between cores. There isn't a single shared bus they all have to compete for. Even the ring bus Intel used before Skylake-Xeon, and still uses in client chips, is pipelined and has pretty decent aggregate bandwidth.

Data can be moving between every pair of cores at the same time. (I mean, 128 pairs of cores can each have data in flight in both directions. With some memory-level parallelism, a pipelined interconnect can have multiple cache lines in flight requested by each core.)

This involves shared L3 cache, but typically not DRAM, even across sockets. (Or on AMD where clusters of cores are tightly connected in a CCX core complex, between those within the same die).

See also some Anandtech articles with good benchmarks of inter-core latency (cache-line ping-pong)

  • https://www.anandtech.com/show/16529/amd-epyc-milan-review/4 AMD Zen3 Epyc (server)
  • https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-dive-review-5950x-5900x-5800x-and-5700x-tested/5 AMD Zen 3 desktop
  • https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/4 Intel Ice Lake Xeon (vs. AMD and vs. a big AArch64 Ampere Altra)
  • https://www.anandtech.com/show/16315/the-ampere-altra-review/3 Ampere Altra (2x 80 AArch64 cores)
  • https://www.anandtech.com/show/15578/cloud-clash-amazon-graviton2-arm-against-intel-and-amd/2 Amazon's Graviton2 AArch64 vs. AMD and Intel.


GPUs also support parallel atomic update acceleration for local arrays.

I think I've heard of some CPUs (at least in theory, maybe practice) allowing fast memory_order_relaxed atomics by putting a simple ALU into the shared cache. So a core can send an atomic-increment request to shared L3 where it happens on the data there, instead of having to get exclusive ownership of the line temporarily. (With the old value being returned read-only to allow a fetch_add or exchange return value).

This wouldn't be easily ordered wrt. other loads or stores on other locations done by the core that sent the atomic operation request out to be done by the cache.

Anandtech's Graviton2 review shows a slide that mentions "Dynamic near vs. far atomic execution". That might be a form of this idea! Perhaps allowing it to execute remotely (perhaps at the core owning the cache line?) if memory ordering requirements are weak enough to allow it for this instruction? That's just a guess, that's a separate question which I won't dig into further here.

(ARMv8.1 with "large system extensions" provide x86-style single-instruction atomic RMWs, as well as the traditional LL/SC that require a retry loop in case of spurious failure, but can synthesize any atomic RMW operation like you can with a CAS retry loop.)

Intel x86 vs x64 system call

General part

EDIT: Linux irrelevant parts removed

While not totally wrong, narrowing down to int 0x80 and syscall oversimplifies the question as with sysenter there is at least a 3rd option.

Using 0x80 and eax for syscall number, ebx, ecx, edx, esi, edi, and ebp to pass parameters is just one of many possible other choices to implement a system call, but those registers are the ones the 32-bit Linux ABI chose.

Before taking a closer look at the techniques involved, it should be stated that they all circle around the problem of escaping the privilege prison every process runs in.

Another choice to the ones presented here offered by the x86 architecture would have been the use of a call gate (see: http://en.wikipedia.org/wiki/Call_gate)

The only other possibility present on all i386 machines is using a software interrupt, which allows the ISR (Interrupt Service Routine or simply an interrupt handler) to run at a different privilege level than before.

(Fun fact: some i386 OSes have used an invalid-instruction exception to enter the kernel for system calls, because that was actually faster than an int instruction on 386 CPUs. See OsDev syscall/sysret and sysenter/sysexit instructions enabling for a summary of possible system-call mechanisms.)

Software Interrupt

What exactly happens once an interrupt is triggered depends on whether switching to the ISR requires a privilege change or not:

(Intel® 64 and IA-32 Architectures Software Developer’s Manual)

6.4.1 Call and Return Operation for Interrupt or Exception Handling Procedures

...

If the code segment for the handler procedure has the
same privilege level as the currently executing program or task, the
handler procedure uses the current stack; if the handler executes at a
more privileged level, the processor switches to the stack for the
handler’s privilege level.

....

If a stack switch does occur, the processor does the following:

  1. Temporarily saves (internally) the current contents of the SS, ESP, EFLAGS, CS, and > EIP registers.

  2. Loads the segment selector and stack pointer for the new stack (that is, the stack for the privilege level being
    called) from the TSS into the SS and ESP registers and switches to the new stack.

  3. Pushes the temporarily saved SS, ESP, EFLAGS, CS, and EIP values for the interrupted procedure’s stack onto
    the new stack.

  4. Pushes an error code on the new stack (if appropriate).

  5. Loads the segment selector for the new code segment and the new instruction pointer (from the interrupt gate
    or trap gate) into the CS and EIP registers, respectively.

  6. If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.

  7. Begins execution of the handler procedure at the new privilege level.

... sigh this seems to be a lot to do and even once we're done it doesn't get too much better:

(excerpt taken from the same source as mentioned above: Intel® 64 and IA-32 Architectures Software Developer’s Manual)

When executing a return from an interrupt or exception handler from a
different privilege level than the interrupted procedure, the
processor performs these actions:

  1. Performs a privilege check.

  2. Restores the CS and EIP registers to their values prior to the interrupt or exception.

  3. Restores the EFLAGS register.

  4. Restores the SS and ESP registers to their values prior to the interrupt or exception, resulting in a stack switch back to the
    stack of the interrupted procedure.

  5. Resumes execution of the interrupted procedure.

Sysenter

Another option on the 32-bit platform not mentioned in your question at all, but nevertheless utilized by the Linux kernel is the sysenter instruction.

(Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z)

Description Executes a fast call to a level 0 system procedure or
routine. SYSENTER is a companion instruction to SYSEXIT. The
instruction is optimized to provide the maximum performance for system
calls from user code running at privilege level 3 to operating system
or executive procedures running at privilege level 0.

One disadvantage of using this solution is, that it is not present on all 32-bit machines, so the int 0x80 method still has to be provided in case the CPU doesn't know about it.

The SYSENTER and SYSEXIT instructions were introduced into the IA-32
architecture in the Pentium II processor. The availability of these
instructions on a processor is indicated with the SYSENTER/SYSEXIT
present (SEP) feature flag returned to the EDX register by the CPUID
instruction. An operating system that qualifies the SEP flag must also
qualify the processor family and model to ensure that the
SYSENTER/SYSEXIT instructions are actually present

Syscall

The last possibility, the syscall instruction, pretty much allows for the same functionality as the sysenter instruction. The existence of both is due to the fact that one (systenter) was introduced by Intel while the other (syscall) was introduced by AMD.

Linux specific

In the Linux kernel any of the three possibilities mentioned above may be chosen to realize a system call.

See also The Definitive Guide to Linux System Calls.

As already stated above, the int 0x80 method is the only one of the 3 chosen implementations, that can run on any i386 CPU so this is the only one that is always available for 32-bit user-space.

(syscall is the only one that's always available for 64-bit user-space, and the only one you should ever use in 64-bit code; x86-64 kernels can be built without CONFIG_IA32_EMULATION, and int 0x80 still invokes the 32-bit ABI which truncates pointers to 32-bit.)

To allow to switch between all 3 choices every process run is given access to a special shared object that gives access to the system call implementation chosen for the running system. This is the strange looking linux-gate.so.1 you already might have encountered as unresolved library when using ldd or the like.

(arch/x86/vdso/vdso32-setup.c)

 if (vdso32_syscall()) {                                                                               
vsyscall = &vdso32_syscall_start;
vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;
} else if (vdso32_sysenter()){
vsyscall = &vdso32_sysenter_start;
vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;
} else {
vsyscall = &vdso32_int80_start;
vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;
}

To utilize it all you have to do is load all your registers system call number in eax, parameters in ebx, ecx, edx, esi, edi as with int 0x80 system call implementation and call the main routine.

Unfortunately it is not all that easy; as to minimize the security risk of a fixed predefined address, the location at which the vdso (virtual dynamic shared object) will be visible in a process is randomized, so you will have to figure out the correct location first.

This address is individual to each process and is passed to the process once it is started.

In case you didn't know, when started in Linux, every process gets pointers to the parameters passed once it was started and pointers to a description of the environment variables it is running under passed on its stack - each of them terminated by NULL.

Additionally to these a third block of so called elf-auxiliary-vectors gets passed following the ones mentioned before. The correct location is encoded in one of these carrying the type-identifier AT_SYSINFO.

So stack layout looks like this (addresses grow downwards):

  • parameter-0
  • ...
  • parameter-m
  • NULL
  • environment-0
  • ....
  • environment-n
  • NULL
  • ...
  • auxilliary elf vector: AT_SYSINFO
  • ...
  • auxilliary elf vector: AT_NULL

Usage example

To find the correct address you will have to first skip all arguments and all environment pointers and then start scanning for AT_SYSINFO as shown in the example below:

#include <stdio.h>
#include <elf.h>

void putc_1 (char c) {
__asm__ ("movl $0x04, %%eax\n"
"movl $0x01, %%ebx\n"
"movl $0x01, %%edx\n"
"int $0x80"
:: "c" (&c)
: "eax", "ebx", "edx");
}

void putc_2 (char c, void *addr) {
__asm__ ("movl $0x04, %%eax\n"
"movl $0x01, %%ebx\n"
"movl $0x01, %%edx\n"
"call *%%esi"
:: "c" (&c), "S" (addr)
: "eax", "ebx", "edx");
}

int main (int argc, char *argv[]) {

/* using int 0x80 */
putc_1 ('1');

/* rather nasty search for jump address */
argv += argc + 1; /* skip args */
while (*argv != NULL) /* skip env */
++argv;

Elf32_auxv_t *aux = (Elf32_auxv_t*) ++argv; /* aux vector start */

while (aux->a_type != AT_SYSINFO) {
if (aux->a_type == AT_NULL)
return 1;
++aux;
}

putc_2 ('2', (void*) aux->a_un.a_val);

return 0;
}

As you will see by taking a look at the following snippet of /usr/include/asm/unistd_32.h on my system:

#define __NR_restart_syscall 0
#define __NR_exit 1
#define __NR_fork 2
#define __NR_read 3
#define __NR_write 4
#define __NR_open 5
#define __NR_close 6

The syscall I used is the one numbered 4 (write) as passed in the eax register.
Taking filedescriptor (ebx = 1), data-pointer (ecx = &c) and size (edx = 1) as its arguments, each passed in the corresponding register.

To put a long story short

Comparing a supposedly slow running int 0x80 system call on any Intel CPU with a (hopefully) much faster implementation using the (genuinely invented by AMD) syscall instruction is comparing apples to oranges.

IMHO: Most probably the sysenter instruction instead of int 0x80 should be to the test here.

Can an x86_64 and/or armv7-m mov instruction be interrupted mid-operation?

You have to read the documentation for each separate core and/or chip. x86 is a completely separate thing from ARM, and within both families each instance may vary from any other instance, can be and should expect to be completely new designs each time. Might not be but from time to time are.

Things to watch out for as noted in the comments.

typedef unsigned int uint32_t;

uint32_t uptime = 0;

void ISR ( void )
{
++uptime;
}
void some_func ( void )
{
uint32_t now = uptime;
}

On my machine with the tool I am using today:

Disassembly of section .text:

00000000 <ISR>:
0: e59f200c ldr r2, [pc, #12] ; 14 <ISR+0x14>
4: e5923000 ldr r3, [r2]
8: e2833001 add r3, r3, #1
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0

00000018 <some_func>:
18: e12fff1e bx lr

Disassembly of section .bss:

00000000 <uptime>:
0: 00000000 andeq r0, r0, r0

this could vary, but if you find a tool on one machine one day that builds a problem then you can assume it is a problem. So far we are actually okay. because some_func is dead code the read is optimized out.

typedef unsigned int uint32_t;

uint32_t uptime = 0;

void ISR ( void )
{
++uptime;
}
uint32_t some_func ( void )
{
uint32_t now = uptime;
return(now);
}

fixed

00000000 <ISR>:
0: e59f200c ldr r2, [pc, #12] ; 14 <ISR+0x14>
4: e5923000 ldr r3, [r2]
8: e2833001 add r3, r3, #1
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0

00000018 <some_func>:
18: e59f3004 ldr r3, [pc, #4] ; 24 <some_func+0xc>
1c: e5930000 ldr r0, [r3]
20: e12fff1e bx lr
24: 00000000 andeq r0, r0, r0


Related Topics



Leave a reply



Submit