Direct Memory Access in Linux

I think you can find a lot of documentation about the kmalloc + mmap part.
However, I am not sure that you can kmalloc so much memory in a contiguous way, and have it always at the same place. Sure, if everything is always the same, then you might get a constant address. However, each time you change the kernel code, you will get a different address, so I would not go with the kmalloc solution.

I think you should reserve some memory at boot time, ie reserve some physical memory so that is is not touched by the kernel. Then you can ioremap this memory which will give you
a kernel virtual address, and then you can mmap it and write a nice device driver.

This take us back to linux device drivers in PDF format. Have a look at chapter 15, it is describing this technique on page 443

Edit : ioremap and mmap.
I think this might be easier to debug doing things in two step : first get the ioremap
right, and test it using a character device operation, ie read/write. Once you know you can safely have access to the whole ioremapped memory using read / write, then you try to mmap the whole ioremapped range.

And if you get in trouble may be post another question about mmaping

Edit : remap_pfn_range
ioremap returns a virtual_adress, which you must convert to a pfn for remap_pfn_ranges.
Now, I don't understand exactly what a pfn (Page Frame Number) is, but I think you can get one calling

virt_to_phys(pt) >> PAGE_SHIFT

This probably is not the Right Way (tm) to do it, but you should try it

You should also check that FOO_MEM_OFFSET is the physical address of your RAM block. Ie before anything happens with the mmu, your memory is available at 0 in the memory map of your processor.

What is the difference between DMA and IOMMU?

DMA (direct memory access) is a hardware feature that allows memory access to occur independently of the program currently run by the micro processor. It can either be used by I/O devices to directly read from or write to memory without executing any micro processor instructions. Or, it can be used to efficiently copy blocks of memory. During DMA transfers, the micro processor can execute an unrelated program at the same time.

IOMMU (input–output memory management unit) is a hardware feature that extends MMU to I/O devices. A MMU maps virtual memory addresses to physical memory address. While the normal MMU is used to give each process its own virtual address space, the IOMMU is used to give each I/O device its own virtual address space. That way, the I/O device sees a simple contiguous address space, possibly accessible with 32 bit addresses while in reality the physical address space is fragmented and extends beyond 32 bit.

DMA without IOMMU requires the I/O devices to use the real physical addresses. The physical addresses must also be used by the processor when setting up the DMA transfer. Additionall, DMA without IOMMU can be used for memory copy (as it involves no I/O devices).

IOMMU is only available on more powerful micro processor. You will not find it on microcontrollers and most embedded systems.

What is the difference between DMA and memory-mapped IO?

Memory-mapped I/O allows the CPU to control hardware by reading and writing specific memory addresses. Usually, this would be used for low-bandwidth operations such as changing control bits.

DMA allows hardware to directly read and write memory without involving the CPU. Usually, this would be used for high-bandwidth operations such as disk I/O or camera video input.

Here is a paper has a thorough comparison between MMIO and DMA.

Design Guidelines for High Performance RDMA Systems

Direct Memory Access (DMA) Scheduling in a Multithreaded Application

The network driver will already be using DMA to accelerate to transfers. When you issue a write the kernel will allocate a contiguous block of physical memory and copy the data from your userspace buffer into this memory. During this phase the kernel will attach all the necessary Ethernet and TCP/IP headers.

The kernel will then issue a DMA request to the network card, asking it to take data from that physical memory location and load it into its internal buffers. At this point your write system call will return. When the network card is complete (and the data is on its way out of the adapter) the network card will signal completion to the kernel.

In Linux network drivers are normally single threaded (there are some exceptions to this but it gets complicated), so if you try to write some data and the driver is already active it will still be copied into kernel space but the DMA request will not be performed until the network driver is free again (it'll be triggered when the kernel is next notified that a DMA is complete).

The morale of the story is that this already works and is rather fast, there's nothing you need to do to accelerate and application using DMA, it's already been taken care of. The only piece you could speed up would be the copy in the kernel space buffer, but as this is so much quicker than the actual network transfer (and can be done simultaneously) it doesn't make any difference to throughput, only latency.

N.B. The above is a gross simplification in places, if you want more detail about a specific part edit your question and I'll do what I can.

What is the difference between DMA-Engine and DMA-Controller?

DMA - Direct memory access. The operation of your driver reading or writing from/to your HW memory without the CPU being involved in it (freeing it to do other stuff).

DMA Controller - reading and writing can't be done by magic. if the CPU doesn't do it, we need another HW to do it. Many years ago (at the time of ISA/EISA) it was common to use a shared HW on the motherboard that did this operation. In recent years , each HW has its own DMA HW mechanism.
But in all cases this specific HW gets the source address and the destination address and passes the data. Usually triggering an interrupt when done.

DMA Engine - Now here I am not sure what you mean. I believe you probably refer to the SW side that handles the DMA.
DMA is a little more complicated than usual I\O since all memory SRC and DST has to be physically present at all times during the DMA operation. If the DST address is swapped to disk, the HW will write to a bad address and the system will crash.
This and other aspects of DMA are handled by the driver with code sections you probably refer to as the "DMA Engine"

*Another interpretation of what 'DMA Engine' is, may be a code part of Firmware (or HW) that handles the DMA HW controller on the HW side.

Direct memory access DMA - how does it work?

First of all, DMA (per se) is almost entirely obsolete. As originally defined, DMA controllers depended on the fact that the bus had separate lines to assert for memory read/write, and I/O read/write. The DMA controller took advantage of that by asserting both a memory read and I/O write (or vice versa) at the same time. The DMA controller then generated successive addresses on the bus, and data was read from memory and written to an output port (or vice versa) each bus cycle.

The PCI bus, however, does not have separate lines for memory read/write and I/O read/write. Instead, it encodes one (and only one) command for any given transaction. Instead of using DMA, PCI normally does bus-mastering transfers. This means instead of a DMA controller that transfers memory between the I/O device and memory, the I/O device itself transfers data directly to or from memory.

As for what else the CPU can do at the time, it all depends. Back when DMA was common, the answer was usually "not much" -- for example, under early versions of Windows, reading or writing a floppy disk (which did use the DMA controller) pretty much locked up the system for the duration.

Nowadays, however, the memory typically has considerably greater bandwidth than the I/O bus, so even while a peripheral is reading or writing memory, there's usually a fair amount of bandwidth left over for the CPU to use. In addition, a modern CPU typically has a fair large cache, so it can often execute some instruction without using main memory at all.

Direct Memory Access in Linux