How to Improve Performance of Linux Pipes

Is there a way to improve performance of linux pipes?

Have you tried with smaller blocks?

When I try on my own workstation I note successive improvement when lowering the block size.
It is only in the realm of 10% in my test, but still an improvement. You are looking for 100%.

As it turns out testing further, really small block sizes seem to do the trick:

I tried

dd if=/dev/zero bs=32k count=256000 | dd of=/dev/null bs=32k
256000+0 records in
256000+0 records out
256000+0 records in
256000+0 records out
8388608000 bytes (8.4 GB) copied8388608000 bytes (8.4 GB) copied, 1.67965 s, 5.0 GB/s
, 1.68052 s, 5.0 GB/s

And with your original

dd if=/dev/zero bs=8M count=1000 | dd of=/dev/null bs=8M
1000+0 records in
1000+0 records out
1000+0 records in
1000+0 records out
8388608000 bytes (8.4 GB) copied8388608000 bytes (8.4 GB) copied, 6.25782 s, 1.3 GB/s
, 6.25203 s, 1.3 GB/s

5.0/1.3 = 3.8 so that is a sizable factor.

How to output as fast as possible a fixed buffer?

Well it seems that linux scheduler and IO priorities played had a big role in the slowdown.

Also, spectre and other cpu vunerability mitigations came to play.

After further optimization, to achieve a faster speed I had to tune this things:

1) program nice level (nice -n -20)
2) program ionice level (ionice -c 1 -n 7)
3) pipe size increased 8 times.
4) disable cpu mitigations by adding "pti=off spectre_v2=off l1tf=off" in kernel command line
5) tuning the linux scheduler

echo -n -1 >/proc/sys/kernel/sched_rt_runtime_us
echo -n -1 >/proc/sys/kernel/sched_rt_period_us
echo -n -1 >/proc/sys/kernel/sched_rr_timeslice_ms
echo -n 0 >/proc/sys/kernel/sched_tunable_scaling

Now the program outputs (on the same pc) 8.00 GB/sec!

If you have other ideas you're welcome to contribute.

Can I get a faster output pipe than /dev/null?

Output to /dev/null is implemented in the kernel, which is pretty bloody fast. The output pipe isn't your problem now, it's the time it takes to build the strings that are getting sent to /dev/null. I would recommend you go through the program and comment out (or guard with if $be_verbose) all the lines that are useless print statements. I'm pretty sure that'll give you a noticeable speedup.

Performance of sockets vs pipes

Ken is right. Named pipes are definitely faster on Windows. On UNIX & Linux, you'd want a UDS or local pipe. Same thing, different name.

Anything other than sockets will be faster for local communication. This includes memory mapped files, local pipes, shared memory, COM, etc.

Achieving shell-like pipeline performance in Python

You're timing it wrong. Your perf_counter() calls don't start and stop a timer; they just return a number of seconds since some arbitrary starting point. That starting point probably happens to be the first perf_counter() call here, but it could be any point, even one in the future.

The actual time taken by the subprocess.PIPE method is 4.862174164 - 2.412427189 = 2.449746975 seconds, not 4.862174164 seconds. This timing does not show a measurable performance penalty from subprocess.PIPE.

Linux Pipes as Input and Output

You need to be quite careful with the plumbing:

Call pipe() twice, one for pipe-to-child, one for pipe-from-child, yielding 4 file descriptors.
Call fork().
In child:
- Call close() on standard input (file descriptor 0).
- Call dup() - or dup2() - to make read end of pipe-to-child into standard input.
- Call close() on read end of pipe-to-child.
- Call close() on write end of pipe-to-child.
- Call close() on standard output (file descriptor 1).
- Call dup() - or dup2() - to make write end of pipe-from-child into standard output
- Call close() on write end of pipe-from-child.
- Call close() on read end of pipe-from-child.
- Execute the required program.
In parent:
- Call close on read end of pipe-to-child.
- Call close on write end of pipe-from-child.
- Loop sending data to child on write end of pipe-to child and reading data from child on read end of pipe-from-child
- When no more to send to child, close write end of pipe-to-child.
- When all data received, close read end of pipe-from-child.

Note how many closes there are, especially in the child. If you use dup2(), you don't have to close standard input and output explicitly; however, dup() works correctly if you do the explicit closes. Also note that neither dup() nor dup2() closes the file descriptor that is duplicated. If you omit closing the pipes, then neither program can detect EOF correctly; the fact that the current process can still write to a pipe means that there is no EOF on the pipe, and the program will hang indefinitely.

Note that this solution does not alter standard error for the child; it goes to the same place as standard error for the parent. Often, this is correct. If you have a specific requirement that error messages from the child are handled differently, then take appropriate action on the child's standard error too.

pipes vs tmfiles. What is better and why?

Use pipe where possible, unless you expect large amounts of input to build up in the stream without being read. A pipe keeps data in RAM where a temporary file requires filesystem operations. An fdsync on a file will be much more expensive than on a pipe. A pipe is also less vulnerable to security issues caused by race conditions.

If your application cannot use pipe semantics (requires a filesystem path for its output or a similar problem), try using a "named pipe" (also called a FIFO).

When to use Pipes vs When to use Shared Memory

Essentially, pipes - whether named or anonymous - are used like message passing. Someone sends a piece of information to the recipient and the recipient can receive it. Shared memory is more like publishing data - someone puts data in shared memory and the readers (potentially many) must use synchronization e.g. via semaphores to learn about the fact that there is new data and must know how to read the memory region to find the information.

With pipes the synchronization is simple and built into the pipe mechanism itself - your reads and writes will freeze and unfreeze the app when something interesting happens. With shared memory, it is easier to work asynchronously and check for new data only once in a while - but at the cost of much more complex code. Plus you can get many-to-many communication but it requires more work again. Also, due to the above, debugging of pipe-based communication is easier than debugging shared memory.

A minor difference is that fifos are visible directly in the filesystem while shared memory regions need special tools like ipcs for their management in case you e.g. create a shared memory segment but your app dies and doesn't clean up after itself (same goes for semaphores and many other synchronization mechanisms which you might need to use together with shared memory).

Shared memory also gives you more control over bufferring and resource use - within limits allowed by the OS it's you who decides how much memory to allocate and how to use it. With pipes, the OS controls things automatically, so once again you loose some flexibility but are relieved of much work.

Summary of most important points: pipes for one-to-one communication, less coding and letting the OS handle things, shared memory for many-to-many, more manual control over things but at the cost of more work and harder debugging.