Poor Memcpy Performance in User Space for Mmap'Ed Physical Memory in Linux

Poor memcpy Performance on Linux

[I would make this a comment, but do not have enough reputation to do so.]

I have a similar system and see similar results, but can add a few data points:

  • If you reverse the direction of your naive memcpy (i.e. convert to *p_dest-- = *p_src--), then you may get much worse performance than for the forward direction (~637 ms for me). There was a change in memcpy() in glibc 2.12 that exposed several bugs for calling memcpy on overlapping buffers (http://lwn.net/Articles/414467/) and I believe the issue was caused by switching to a version of memcpy that operates backwards. So, backward versus forward copies may explain the memcpy()/memmove() disparity.
  • It seems to be better to not use non-temporal stores. Many optimized memcpy() implementations switch to non-temporal stores (which are not cached) for large buffers (i.e. larger than the last level cache). I tested Agner Fog's version of memcpy (http://www.agner.org/optimize/#asmlib) and found that it was approximately the same speed as the version in glibc. However, asmlib has a function (SetMemcpyCacheLimit) that allows setting the threshold above which non-temporal stores are used. Setting that limit to 8GiB (or just larger than the 1 GiB buffer) to avoid the non-temporal stores doubled performance in my case (time down to 176ms). Of course, that only matched the forward-direction naive performance, so it is not stellar.
  • The BIOS on those systems allows four different hardware prefetchers to be enabled/disabled (MLC Streamer Prefetcher, MLC Spatial Prefetcher, DCU Streamer Prefetcher, and DCU IP Prefetcher). I tried disabling each, but doing so at best maintained performance parity and reduced performance for a few of the settings.
  • Disabling the running average power limit (RAPL) DRAM mode has no impact.
  • I have access to other Supermicro systems running Fedora 19 (glibc 2.17). With a Supermicro X9DRG-HF board, Fedora 19, and Xeon E5-2670 CPUs, I see similar performance as above. On a Supermicro X10SLM-F single socket board running a Xeon E3-1275 v3 (Haswell) and Fedora 19, I see 9.6 GB/s for memcpy (104ms). The RAM on the Haswell system is DDR3-1600 (same as the other systems).

UPDATES

  • I set the CPU power management to Max Performance and disabled hyperthreading in the BIOS. Based on /proc/cpuinfo, the cores were then clocked at 3 GHz. However, this oddly decreased memory performance by around 10%.
  • memtest86+ 4.10 reports bandwidth to main memory of 9091 MB/s. I could not find if this corresponds to read, write, or copy.
  • The STREAM benchmark reports 13422 MB/s for copy, but they count bytes as both read and written, so that corresponds to ~6.5 GB/s if we want to compare to the above results.

Memcpy performance on /dev/mem outside kernel ram

Note: A lot of this is prefaced by my top comments, so I'll try to avoid repeating them verbatim.

About buffers for DMA, kernel access, and userspace access. The buffers can be allocated by any mechanism that is suitable.

As mentioned, using mem=512M and /dev/mem with mmap in userspace, the mem driver may not set optimal caching policy. Also, the mem=512M is more typically used to tell the kernel to just never use the memory (e.g. we want to test with less system memory) and we're not going to use the upper 512M for anything.

A better way may to leave off mem=512M and use CMA as you mentioned. Another way may be to bind the driver into the kernel and have it reserve the full memory block during system startup [possibly using CMA].

The memory area might be chosen via kernel command line parameters [from grub.cfg] such as mydev.area= and mydev.size=. That is useful for the "bound" driver that must know these values during the "early" phases of system startup.

So, now we have the "big" area. Now, we need to have a way for the device to get access and the application to get it mapped. The kernel driver can do this. When the device is opened, an ioctl can set up the mappings, with correct kernel policy.

So, depending on the allocation mechanism, the ioctl can be given address/length by the application, or it can pass them back to the application [suitably mapped].

When I had to do this, I created a struct that described a memory area/buffer. It can be the whole area or the large area can be subdivided as needed. Rather than using a variable length, dynamic scheme equivalent to malloc [like what you were writing], I've found that fixed size subpools work better. In the kernel, this is called a "slab" allocator.

The struct had an "id" number for the given area. It also had three addresses: address app could use, address kernel driver could use, and address that would be given to H/W device. Also, in the case of multiple devices, it might have an id for which particular device it is currently associated with.

So, you take the large area and subdivide like this. 5 devices. Dev0 needs 10 1K buffers, Dev1 needs 10 20K buffers, Dev3 needs 10 2K buffers, ...

The application and kernel driver would keep lists of these descriptor structs. The application would start DMA with another ioctl that would take a descriptor id number. Repeat this for all devices.

The application could then issue an ioctl that waits for completion. The driver fills in the descriptor of the just completed operation. The app processes the data and loops. It does this "in-place"--See below.

You're concerned about memcpy speed being slow. As we've discussed, that may be due to the way you were using mmap on /dev/mem.

But, if you're DMAing from a device into memory, the CPU cache may become stale, so you have to account for that. A real device driver has plenty of in-kernel support routines to handle this.

Here's a big one: Why do you need to do a memcpy at all? If things are set up properly, the application can operate directly on the data without needing to copy it. That is, the DMA operation puts the data in exactly the place the app needs it.

At a guess, right now, you've got your memcpy "racing" against the device. That is, you've got to copy off the data fast, so you can start the next DMA without losing any data.

The "big" area should be subdivided [as mentioned above] and the kernel driver should know about the sections. So, the driver starts DMA to id 0. When that completes, it immediately [in the ISR] starts DMA to id 1. When that completes, it goes onto the next one in its subpool. This can be done in a similar manner for each device. The application could poll for completion with an ioctl

That way, the driver can keep all devices running at maximum speed and the application can have plenty of time to process a given buffer. And, once again, it doesn't need to copy it.

Another thing to talk about. Are the DMA registers on your devices double buffered or not? I'm assuming that your devices don't support sophisticated scatter/gather lists and are relatively simple.

In my particular case, in rev 1 of the H/W the DMA registers were not double buffered.

So, after starting DMA on buffer 0, the driver had to wait until the completion interrupt for buffer 0 before setting the DMA registers up for the next transfer to buffer 1. Thus, the driver had to "race" to do the setup for the next DMA [and had a very short window of time to do so]. After starting buffer 0, if the driver had changed the DMA registers on the device, it would have disrupted the already active request.

We fixed this in rev 2 with double buffering. When the driver setup the DMA regs, it would hit the "start" port. All the DMA ports were immediately latched by the device. At this point, the driver was free to do the full setup for buffer 1 and the device would automatically switch to it [without driver intervention] when buffer 0 was complete. The driver would get an interrupt, but could take almost the entire transfer time to set up the next request.

So, with rev 1 style system, a uio approach could not have worked--it would be way too slow. With rev 2, uio might be possible, but I'm not a fan, even if it's possible.

Note: In my case, the we did not use read(2) or write(2) to the device read/write callbacks at all. Everything was handled through special ioctl calls that took various structs like the one mentioned above. At a point early on, we did use read/write in a manner similar to the way uio uses them. But, we found the mapping to be artificial and limiting [and troublesome], so we converted over to the "only ioctl" approach.

More to the point, what are the requirements? Amount of data transferred per second. Number of devices that do what? Are they all input or are there output ones as well?

In my case [which did R/T processing of broadcast quality hidef H.264 video], we were able to do processing in the driver and application space as well as the custom FPGA logic. But, we used a full [non-uio] driver approach, even though, architecturally it looked like uio in places.

We had stringent requirements for reliability, R/T predictability, guaranteed latency. We had to process 60 video frames / second. If we ran over, even by a fraction, our customers started screaming. uio could not have done this for us.

So, you started this with a simple approach. But, I might take a step back and look at requirements, device capabilities/restrictions, alternate ways to get contiguous buffers, R/T throughput and latency, and reassess things. Does your current solution really address all the needs? Currently, you're already running into hot spots [data races between app and device] and/or limitations. Or, would you be better off with a native driver that gives you more flexibility (i.e. There might be an as yet unknown that will force the native driver).

Xilinx probably provides a suitable skeleton full driver in their SDK that you could hack up pretty quickly.

Using DMA Memory Transfer in User-Space

  • Linux's API for DMA doesn't permit memory to memory transfers. It's only for communication between devices and memory. Look in Documentation/DMA-API.txt for more details.

  • At hardware level, the x86 DMA controller doesn't allow memory to memory transfers. It's been discussed here: DMA transfer RAM-to-RAM

  • Given that the memory bus is usually slower than the CPU, what benefit would it have to launch a kernel driven memory copy ? You'd still have to wait for the transfer to finish and its duration would still be the determined by the memory bandwidth, exactly as with a CPU driven copy.

  • If your program's performance solely depends on memory to memory copy performance, it means that it can be probably be strongly improved by avoiding copy as much as possible, or by implementing a smarter procedure such as copy on write.



Related Topics



Leave a reply



Submit