Sort & Uniq in Linux Shell

Sort & uniq in Linux shell

Using sort -u does less I/O than sort | uniq, but the end result is the same. In particular, if the file is big enough that sort has to create intermediate files, there's a decent chance that sort -u will use slightly fewer or slightly smaller intermediate files as it could eliminate duplicates as it is sorting each set. If the data is highly duplicative, this could be beneficial; if there are few duplicates in fact, it won't make much difference (definitely a second order performance effect, compared to the first order effect of the pipe).

Note that there times when the piping is appropriate. For example:

sort FILE | uniq -c | sort -n

This sorts the file into order of the number of occurrences of each line in the file, with the most repeated lines appearing last. (It wouldn't surprise me to find that this combination, which is idiomatic for Unix or POSIX, can be squished into one complex 'sort' command with GNU sort.)

There are times when not using the pipe is important. For example:

sort -u -o FILE FILE

This sorts the file 'in situ'; that is, the output file is specified by -o FILE, and this operation is guaranteed safe (the file is read before being overwritten for output).

Difference between using the uniq command with sort or without it in linux

uniq removes adjacent duplicates. If you want to omit duplicates that are not adjacent, you'll have to sort the data first.

using Linux cut, sort and uniq

You can add a delimiter, which is a comma in your case:

cut -f 3 -d, list.txt | sort | uniq

then, -c specifies character position, rather than field, which is specified with -f.

To strip spaces in front you can pipe this all through, e.g. awk '{print $1}', i.e.

cut -f 3 -d, list.txt | awk '{print $1}' | sort | uniq

[edit]

Aaaaand. If you try to cut the 3rd field out, you are left with only one field after the pipe, so sorting on the 3rd field won't work, which is why I omitted it in my example. You get 1 field, you just sort on it and the apply uniq.

How to apply sort and uniq having the same input and output file in Unix?

Edit --

I wrote below when I was half asleep, ;-) and there's a great shortcut for your case

  sort -u -o file file

The -u option makes the sorted data uniq, and as mentioned below, -o file will save the output to any file your care to name, including the same name as the input.

If you want to do something like

  sort < file | uniq -c > uniqFileWithCounts

Then the first idea won't help you.


Don't kid yourself, even when you use sort -o file file to reuse the same filename for the sorted -o(utput), behind the scenes the system has to write all of the data to a tmp file and then rename to the file specified by -o file (Also sort is writing intermediate sort data to the /tmp dir and deletes that when the final output is complete).

So you're best bet is something like

sort <reuniune1.txt | uniq > uniqd.txt && mv uniqd.txt reuniune1.txt 

This will only overwrite reuniune1.txt if the sort | uniq process exits without error.

IHTH

calling uniq and sort in different orders in shell

The only correct order is to call uniq after sort, since the man page for uniq says:

Discard all but one of successive identical lines from INPUT (or standard input), writing to OUTPUT (or standard output).

Therefore it should be

grep 'somePattern' | sort | uniq

What is the difference between 'sort -u' and 'uniq'?

I'm not sure that it's about availability. Most systems I've ever seen have sort and uniq as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort has the -u option.

Technically, using a linux pipe (|) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.

If you go to the source code for sort, which comes in the coreutils package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq code.

To see how it works follow the link to sort's source and see the functions below this comment:

 /* If uniquified output is turned on, output only the first of
an identical series of lines. */

Although I believe sort -u should be faster, the performance gains are really going to be minimal unless you're running sort | uniq on huge files, as it will have to read through the entire file again.

Sorting by unique values of multiple fields in UNIX shell script

Since you are apparently O.K. with randomly choosing among the values for dir, day, TI, and stn, you can write:

sort -u -t ';' -k 1,1 -k 6,6 -s < input_file > output_file

Explanation:

  • The sort utility, "sort lines of text files", lets you sort/compare/merge lines from files. (See the GNU Coreutils documentation.)
  • The -u or --unique option, "output only the first of an equal run", tells sort that if two input-lines are equal, then you only want one of them.
  • The -k POS[,POS2] or --key=POS1[,POS2] option, "start a key at POS1 (origin 1), end it at POS2 (default end of line)", tells sort where the "keys" are that we want to sort by. In our case, -k 1,1 means that one key consists of the first field (from field 1 through field 1), and -k 6,6 means that one key consists of the sixth field (from field 6 through field 6).
  • The -t SEP or --field-separator=SEP option tells sort that we want to use SEP — in our case, ';' — to separate and count fields. (Otherwise, it would think that fields are separated by whitespace, and in our case, it would treat the entire line as a single field.)
  • The -s or --stabilize option, "stabilize sort by disabling last-resort comparison", tells sort that we only want to compare lines in the way that we've specified; if two lines have the same above-defined "keys", then they're considered equivalent, even if they differ in other respects. Since we're using -u, that means that means that one of them will be discarded. (If we weren't using -u, it would just mean that sort wouldn't reorder them with respect to each other.)


Related Topics



Leave a reply



Submit