dma_map_single internals on arm archtecture
Not quite.
For the streaming DMA API (i.e. dma_map_*()
/dma_unmap_*()
), nothing is actually remapped. Only addresses from the kernel linear mapping (i.e. normal kmalloc()
memory) are valid for streaming DMA, so since the CPU mapping is cacheable, the dma_map_*()
operations for a non-coherent device will clean/invalidate the caches as appropriate for the extent of the buffer and rely on the CPU not accessing it until the corresponding dma_unmap_*()
. That will then (if appropriate) invalidate the caches again, in case of any speculative fetches in the meantime, before the CPU may read any data written to memory by the device. For cache-coherent devices, none of that is needed, so it's skipped.
Since the buffer is in the linear map, the DMA address is a simple case of a virt_to_phys()
offset, minus any device-specific offset to convert between physical memory and bus addresses in certain cases of funky hardware (e.g. Raspberry Pi 2/3 or TI Keystone 2) - see e.g. the ARM implementation of dma_map_page()
(of which dma_map_single()
is merely a special case). Where an IOMMU is involved, there is the additional step of creating an IOVA mapping for that physical address, and returning that IOVA instead of the underlying bus address.
Note that for the coherent DMA API (i.e. dma_alloc_coherent()
), when a device is not cache-coherent itself, we do create a separate non-cacheable mapping of the allocated pages in the vmalloc area, then use that non-cacheable alias for all CPU accesses to that buffer (after some initial cache maintenance to clean the linear map alias), since unlike streaming DMA, both the CPU and the device are allowed to access a coherent buffer at any time.
When are page frame specific cache management policies useful?
Having multiple cache management policies is widely used, whether by assigning whole regions using MTRRs (fixed/dynamic, as explained in Intel's PRM), MMIO regions, or through special instructions (e.g. streaming loads/stores, non-temporal prefetches, etc..). The use-cases also vary a lot, whether you're trying to map an external I/O device into virtual memory (and don't want CPU caching to impact its coherence), or whether you want to define a writethrough region for better integrity management of some database, or just want plain writeback to maximize the cache-hierarchy capacity and replacement efficiency (which means performance).
These usages often overlap (especially when multiple applications are running), so the flexibility is very much needed, as you said - you don't want data with little to no spatial/temporal locality to thrash out other lines you use all the time.
By the way, caches are never going to be big enough in the foreseeable future (with any known technology), since increasing them requires you to locate them further away from the core and pay in latency. So cache management is still, and is going to be for a long while, one of the most important things for performance critical systems and applications
Related Topics
200,000 Images in Single Folder in Linux, Performance Issue or Not
What Is This $Path in Linux and How to Modify It
Which Stack Is Used by Interrupt Handler - Linux
Search Ms Word Files in a Directory for Specific Content in Linux
Switch from 32Bit Mode to 64 Bit (Long Mode) on 64Bit Linux
How to Redirect from Audio Output to Mic Input Using Pulseaudio
Cannot Clone Repository: Fatal: R Any Gitolite-Admin Gitolite Denied by Fallthru
Prevent Process from Killing Itself Using Pkill
Creating a Self-Extracting Zip Archive on a Linux Box
How to Pass an Environment Variable to a Netbeans Makefile on Ubuntu
Find the Process Run by Nohup Command
How to Run Linux Docker Images on Windows Server 2016
How to "Git Pull" - Host Key Verification Failed
I Can't Execute Command Modprobe Vboxdrv
Compile Swift Script with Static Swift Core Library
How to Run an Opengl Application Installed on a Linux MAChine from My Windows MAChine