Dma Cache Coherence Management

dma_map_single internals on arm archtecture

Not quite.

For the streaming DMA API (i.e. dma_map_*()/dma_unmap_*()), nothing is actually remapped. Only addresses from the kernel linear mapping (i.e. normal kmalloc() memory) are valid for streaming DMA, so since the CPU mapping is cacheable, the dma_map_*() operations for a non-coherent device will clean/invalidate the caches as appropriate for the extent of the buffer and rely on the CPU not accessing it until the corresponding dma_unmap_*(). That will then (if appropriate) invalidate the caches again, in case of any speculative fetches in the meantime, before the CPU may read any data written to memory by the device. For cache-coherent devices, none of that is needed, so it's skipped.

Since the buffer is in the linear map, the DMA address is a simple case of a virt_to_phys() offset, minus any device-specific offset to convert between physical memory and bus addresses in certain cases of funky hardware (e.g. Raspberry Pi 2/3 or TI Keystone 2) - see e.g. the ARM implementation of dma_map_page() (of which dma_map_single() is merely a special case). Where an IOMMU is involved, there is the additional step of creating an IOVA mapping for that physical address, and returning that IOVA instead of the underlying bus address.

Note that for the coherent DMA API (i.e. dma_alloc_coherent()), when a device is not cache-coherent itself, we do create a separate non-cacheable mapping of the allocated pages in the vmalloc area, then use that non-cacheable alias for all CPU accesses to that buffer (after some initial cache maintenance to clean the linear map alias), since unlike streaming DMA, both the CPU and the device are allowed to access a coherent buffer at any time.

When are page frame specific cache management policies useful?

Having multiple cache management policies is widely used, whether by assigning whole regions using MTRRs (fixed/dynamic, as explained in Intel's PRM), MMIO regions, or through special instructions (e.g. streaming loads/stores, non-temporal prefetches, etc..). The use-cases also vary a lot, whether you're trying to map an external I/O device into virtual memory (and don't want CPU caching to impact its coherence), or whether you want to define a writethrough region for better integrity management of some database, or just want plain writeback to maximize the cache-hierarchy capacity and replacement efficiency (which means performance).
These usages often overlap (especially when multiple applications are running), so the flexibility is very much needed, as you said - you don't want data with little to no spatial/temporal locality to thrash out other lines you use all the time.

By the way, caches are never going to be big enough in the foreseeable future (with any known technology), since increasing them requires you to locate them further away from the core and pay in latency. So cache management is still, and is going to be for a long while, one of the most important things for performance critical systems and applications



Related Topics



Leave a reply



Submit