Kernel Oops Page Fault Error Codes for Arm

Kernel oops Oops: 80000005 on arm embedded system

According to the error message, the code that caused this kernel panic resides at virtual address 0x7eb52754. Judging from the address (just below 0x8000000), I'm guessing this is the code segment of a kernel module - probably one of your own kernel modules.

To do a root cause analyses, load your (and all other) kernel modules in the same order as they were loaded when this panic occurred and observe their load address as printed by lsmod (or cat /proc/modules which is almost the same).

Using their code size and load address, calculate which module text segment resides at virtual address 0x7eb52754. The subtract 0x7eb52754 from the module load address.

What you will get is the offset into the module binary of the instruction that caused the panic.

Now use objdump on the kernel module binary and look for that offset, and check to which function it belong (this can also be done with add2line, if you have that too). This should point you to the function and even line number (if you have debug information) of the instruction that caused this panic.

good luck.

What does code 0x017 signifies in unhandled page fault

This is not the si_code, but the value of ARM's FSR (Fault Status Register) (source):

 0x17 = 0b1 0111

According to ARM manual:

[Bits 7:4] Specifies which of the 16 domains (D15-D0) was being accessed
when a data fault occurred.

[Bits 3:0] Type of fault generated

So domain is 1, which is DOMAIN_USER in the kernel (all user memory only). Type of fault is page translation fault, page.

Assembly page fault handler cannot be called due to invalid stack pointer

How should I resolve this situation?

I'd resolve the situation by using avoidance - don't let kernel have a dodgy stack pointer in the first place (and don't let kernel stack be sent to swap space, don't use page fault for "auto-growing kernel stack", etc). Note that CPU will automatically switch to kernel stack if a page fault happens in user-space (at CPL=3) so it doesn't matter if user-space has a dodgy stack pointer.

Alternatives are:

  • force a kernel stack switch when kernel code (CPL=0) causes a page fault. This can be done using hardware task switch (protected mode) or the IST mechanism (long mode) for the page fault exception handler. This would be the best option for recovery (e.g. makes it easier to figure out what the problem was, fix it, then return).

  • force a kernel stack switch when kernel code (CPL=0) causes a double fault. This can be done using hardware task switch (protected mode) or the IST mechanism (long mode) for the double fault exception handler. This would be the best option for performance (no added overhead for normal page faults).

Note 1: Be warned that neither hardware task switching/task gates nor IST are re-entrant. For hardware task switching, if a second page fault occurs while you're handling the first page fault you'll get a general protection fault (because the "page fault task" is busy); and for IST, if a second page fault occurs while you're handling the first page fault the second page fault will trash/overwrite the first page fault's stack and make it impossible to recover. In theory, you can mitigate these problems by switching to a different task or different stack as soon as possible, but that's complicated/messy and likely to cause even more problems.

Note 2: You'll probably end up with a combination of avoidance and double fault using hardware task switch or IST; with the double fault handler doing "freeze system and dump info/panic" as a generic fallback for catastrophic kernel failures (that were supposed to be avoided but weren't).

Note 3: If you want to support "auto-growing kernel stacks"; you can use "stack probes" instead - basically, just do dummy read/s (in function epilogues) from "future stack" before using the memory for stack, so that the page fault occurs when there's still enough kernel stack left for the page fault handler.

what do these kernel panic errors mean?

The values in parenthesis are the ifsr (instruction fault status) register. There are many causes for aborts and these give a specific cause. There are some tables in the kernel that handle particular fault causes and other have a handler which does a printk and aborts a task or can panic() the kernel. See: arm/mm/fault.c. The value is probably not valuable unless you are developing a fault handler. Although it can give an idea of what the fault is about, it is better just to get the PC and look at the code at that address (which I think was already printed?).

These faults can occur anywhere; in a user task, a kernel task or an interrupt handler, etc. Since your interrupt handler has crashed, Linux decides to stop everything and not bother proceeding. Otherwise, you could corrupts disks (even more), etc.

Note: Each fault status register has an abort.S file which is different for the particular ARM CPU. For example see abort-ev7.S v7_early_abort. This is put in a processor table which is matched at boot time.

  1. Unhandled fault - trying to read memory that is not mapped (via MMU).
  2. Kernel panic - an unhandled fault occurred in code deemed un-recoverable.


Related Topics



Leave a reply



Submit