Why Using Pipe for Sort (Linux Command) Is Slow

Why using pipe for sort (linux command) is slow?

When reading from a pipe, sort assumes that the file is small, and for small files parallelism isn't helpful. To get sort to utilize parallelism you need to tell it to allocate a large main memory buffer using -S. In this case the data file is about 8GB, so you can use -S8G. However, at least on your system with 128GB of main memory, method 2 may still be faster.

This is because sort in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp need not be written to disc before awk exits, and sort will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.

Fork Process / Read Write through pipe SLOW

Oops! Did you check your LAME output?

Looking at your code, in particular

static char * const k_lame_args[] = {
"--decode",
"--mp3input",
"-",
"-",
NULL
};

and

if (execv("/usr/local/bin/lame", k_lame_args) == -1) {

means you are accidentally omitting the --decode flag as it will be argv[0] for LAME, instead of the first argument (argv[1]). You should use

static char * const k_lame_args[] = {
/* argv[0] */ "lame",
/* argv[1] */ "--decode",
/* argv[2] */ "--mp3input",
/* argv[3] */ "-",
/* argv[4] */ "-",
NULL
};

instead.

I think you are seeing a slowdown because you're accidentally recompressing the MP3 audio. (I noticed this just a minute ago, so haven't checked if LAME does that if you omit the --decode flag, but I believe it does.)

How could the UNIX sort command sort a very large file?

The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.

Pipelining cut sort uniq

Use

cut -f 2 practice.sam | sort -o | uniq -c

In your original code, you're redirecting the output of cut to field2.txt and at the same time, trying to pipe the output into sort. That won't work (unless you use tee). Either separate the commands as individual commands (e.g., use ;) or don't redirect the output to a file.

Ditto the second half, where you write the output to sortedfield2.txt and thus end up with nothing going to stdout, and nothing being piped into uniq.

So an alternative could be:

cut -f 2 practice.sam > field2.txt ; sort -o field2.txt sortedfield2.txt ; uniq -c sortedfield2.txt

which is the same as

cut -f 2 practice.sam > field2.txt 
sort -o field2.txt sortedfield2.txt
uniq -c sortedfield2.txt

Linux Shell Command: Find. How to Sort and Exec without using Pipes?

With BSD find

A -s argument is available to request lexographic sort order.

find . -s -type f -exec md5sum -- '{}' +

With GNU find

Use NUL delimiters to allow filenames to be processed unambiguously. Assuming you have GNU tools:

find . -type f -print0 | sort -z | xargs -0 md5sum

Aren't Named Pipes in the Filesystem slow?

Aren't Named Pipes in the Filesystem slow?

They're no slower than any other sort of pipe.

isn't it better to create a "buffered" pipe in the memory

If you aren't memory constrained, then yes (see older OS link below).

[...] or isn't it writing onto the disk?

Your guess is correct - on many modern operating systems data going into a named pipe is not being written to the disk; the filesystem is just the namespace that holds something that tells you where the ends of the pipe can be found. From the Linux man page for pipe:

Note: although FIFOs have a pathname in the filesystem, I/O on FIFOs does not involve operations on the underlying device (if there is one).

There are older operating systems that buffer pipe data within a filesystem but given your question's phrasing (on such systems ALL pipes go through the filesystem not just named ones) I suspect this is a tangent.

Turn off buffering in pipe

You can use the unbuffer command (which comes as part of the expect package), e.g.

unbuffer long_running_command | print_progress

unbuffer connects to long_running_command via a pseudoterminal (pty), which makes the system treat it as an interactive process, therefore not using the 4-kiB buffering in the pipeline that is the likely cause of the delay.

For longer pipelines, you may have to unbuffer each command (except the final one), e.g.

unbuffer x | unbuffer -p y | z

Linux piping & loop

You could use a named pipe/FIFO:

mkfifo cmd3-to-cmd1
cmd1 < cmd3-to-cmd1 | cmd2 | cmd3 >> cmd3-to-cmd1


Related Topics



Leave a reply



Submit