Pte Structure in The Linux Kernel

PTE structure in the linux kernel

The pteval_t just treats the page table entry as an opaque blob - on the architecture you're looking at, it's just a 32 bit unsigned value.

The fields within the PTE are accessed using bitwise operators and masks - in the source I have handy (Linux 2.6.24), these are defined in include/asm-x86/pgtable_32.h. The fields you see in the Intel Reference Manual (most of which are single-bit flags) are defined here - for example:

#define _PAGE_PRESENT   0x001
#define _PAGE_RW 0x002
#define _PAGE_USER 0x004
#define _PAGE_PWT 0x008
#define _PAGE_PCD 0x010
#define _PAGE_ACCESSED 0x020
#define _PAGE_DIRTY 0x040
#define _PAGE_PSE 0x080 /* 4 MB (or 2MB) page, Pentium+, if present.. */
#define _PAGE_GLOBAL 0x100 /* Global TLB entry PPro+ */
#define _PAGE_UNUSED1 0x200 /* available for programmer */
#define _PAGE_UNUSED2 0x400
#define _PAGE_UNUSED3 0x800

Details for PTE and struct page

when an userspace process accesses to some virtual address, the MMU tries to find the PTE for the requested virtual address

Yes, this is correct. The page table is walked down to the specific PTE (if any).

In the PTE there is the encoded struct page's PFN and some flags.

Not really. The PTE contains the PFN (Page Frame Number) of the actual physical memory page that the virtual address translates to. In other words, it points to the actual page in physical memory, not to the corresponding struct page.

if the translated address points to a struct page, how exactly is physical memory is accessed? I think struct page is just page descriptor, not a empty physical memory region.

The translated address does not point to a struct page, it points to physical memory. Indeed, the struct page is just a "descriptor" used by the system to keep track of the nature and state of a page, and is stored somewhere else.

All the struct page structures are stored in some specific memory area which depends on the underlying architecture. You can read more about it in Chapter 2: Describing Physical Memory of Mel Gorman's book "Understanding the Linux Virtual Memory Manager".

Once you have a PTE (pte_t), the pte_page() macro can be used to get the address of the corresponding struct page. This address is calculated using a set of macros (e.g. __pfn_to_page()) which basically end up indexing a mem_section which contains a pointer to an array of struct page (.section_mem_map). There is a global array of struct mem_section, and each PTE PFN has a section index encoded in it, which is used to select the correct section.

How does fork() process mark parent's PTE's as read only?

Linux OS implements syscall fork with iterating over all memory ranges (mmaps, stack and heap) of parent process. Copying of that ranges (VMA - Virtual memory areas is in function copy_page_range (mn/memory.c) which has loop over page table entries:

  • copy_page_range will iterate over pgd and call
  • copy_pud_range to iterate over pud and call
  • copy_pmd_range to iterate over pmd and call
  • copy_pte_range to iterate over pte and call
  • copy_one_pte which does memory usage accounting (RSS) and has several code segments to handle COW case:
    /*
* If it's a COW mapping, write protect it both
* in the parent and the child
*/
if (is_cow_mapping(vm_flags)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}

where is_cow_mapping will be true for private and potentially writable pages (bitfield flags is checked for shared and maywrite bits and should have only maywrite bit set)

#define VM_SHARED   0x00000008
#define VM_MAYWRITE 0x00000020

static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

PUD, PMD, and PTE are described in books like https://www.kernel.org/doc/gorman/html/understand/understand006.html and in articles like LWN 2005: "Four-level page tables merged".

How fork implementation calls copy_page_range:

  • fork syscall implementation (sys_fork? or syscall_define0(fork)) is do_fork (kernel/fork.c) which will call
  • copy_process which will call many copy_* functions, including
  • copy_mm which calls
  • dup_mm to allocate and fill new mm struct, where most work is done by
  • dup_mmap (still kernel/fork.c) which will check what was mmaped and how. (Here I was unable to get exact path to COW implementation so I used the Internet Search Machine with something like "fork+COW+dup_mm" to get hints like [1] or [2] or [3]). After checking mmap types there is retval = copy_page_range(mm, oldmm, mpnt); line to do real work.


Related Topics



Leave a reply



Submit