Sort & uniq in Linux shell
Using sort -u
does less I/O than sort | uniq
, but the end result is the same. In particular, if the file is big enough that sort
has to create intermediate files, there's a decent chance that sort -u
will use slightly fewer or slightly smaller intermediate files as it could eliminate duplicates as it is sorting each set. If the data is highly duplicative, this could be beneficial; if there are few duplicates in fact, it won't make much difference (definitely a second order performance effect, compared to the first order effect of the pipe).
Note that there times when the piping is appropriate. For example:
sort FILE | uniq -c | sort -n
This sorts the file into order of the number of occurrences of each line in the file, with the most repeated lines appearing last. (It wouldn't surprise me to find that this combination, which is idiomatic for Unix or POSIX, can be squished into one complex 'sort' command with GNU sort.)
There are times when not using the pipe is important. For example:
sort -u -o FILE FILE
This sorts the file 'in situ'; that is, the output file is specified by -o FILE
, and this operation is guaranteed safe (the file is read before being overwritten for output).
Difference between using the uniq command with sort or without it in linux
uniq
removes adjacent duplicates. If you want to omit duplicates that are not adjacent, you'll have to sort the data first.
using Linux cut, sort and uniq
You can add a delimiter, which is a comma in your case:
cut -f 3 -d, list.txt | sort | uniq
then, -c
specifies character position, rather than field, which is specified with -f
.
To strip spaces in front you can pipe this all through, e.g. awk '{print $1}'
, i.e.
cut -f 3 -d, list.txt | awk '{print $1}' | sort | uniq
[edit]
Aaaaand. If you try to cut
the 3rd field out, you are left with only one field after the pipe, so sorting on the 3rd field won't work, which is why I omitted it in my example. You get 1 field, you just sort on it and the apply uniq
.
How to apply sort and uniq having the same input and output file in Unix?
Edit --
I wrote below when I was half asleep, ;-) and there's a great shortcut for your case
sort -u -o file file
The -u
option makes the sorted data uniq, and as mentioned below, -o file
will save the output to any file your care to name, including the same name as the input.
If you want to do something like
sort < file | uniq -c > uniqFileWithCounts
Then the first idea won't help you.
Don't kid yourself, even when you use sort -o file file
to reuse the same filename for the sorted -o
(utput), behind the scenes the system has to write all of the data to a tmp file and then rename to the file specified by -o file
(Also sort is writing intermediate sort data to the /tmp
dir and deletes that when the final output is complete).
So you're best bet is something like
sort <reuniune1.txt | uniq > uniqd.txt && mv uniqd.txt reuniune1.txt
This will only overwrite reuniune1.txt if the sort | uniq
process exits without error.
IHTH
calling uniq and sort in different orders in shell
The only correct order is to call uniq
after sort
, since the man page for uniq
says:
Discard all but one of successive identical lines from INPUT (or standard input), writing to OUTPUT (or standard output).
Therefore it should be
grep 'somePattern' | sort | uniq
What is the difference between 'sort -u' and 'uniq'?
I'm not sure that it's about availability. Most systems I've ever seen have sort
and uniq
as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort
has the -u
option.
Technically, using a linux pipe (|
) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.
If you go to the source code for sort
, which comes in the coreutils
package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq
code.
To see how it works follow the link to sort's source and see the functions below this comment:
/* If uniquified output is turned on, output only the first of
an identical series of lines. */
Although I believe sort -u
should be faster, the performance gains are really going to be minimal unless you're running sort | uniq
on huge files, as it will have to read through the entire file again.
Sorting by unique values of multiple fields in UNIX shell script
Since you are apparently O.K. with randomly choosing among the values for dir
, day
, TI
, and stn
, you can write:
sort -u -t ';' -k 1,1 -k 6,6 -s < input_file > output_file
Explanation:
- The
sort
utility, "sort lines of text files", lets you sort/compare/merge lines from files. (See the GNU Coreutils documentation.) - The
-u
or--unique
option, "output only the first of an equal run", tellssort
that if two input-lines are equal, then you only want one of them. - The
-k POS[,POS2]
or--key=POS1[,POS2]
option, "start a key at POS1 (origin 1), end it at POS2 (default end of line)", tellssort
where the "keys" are that we want to sort by. In our case,-k 1,1
means that one key consists of the first field (from field1
through field1
), and-k 6,6
means that one key consists of the sixth field (from field6
through field6
). - The
-t SEP
or--field-separator=SEP
option tellssort
that we want to useSEP
— in our case,';'
— to separate and count fields. (Otherwise, it would think that fields are separated by whitespace, and in our case, it would treat the entire line as a single field.) - The
-s
or--stabilize
option, "stabilize sort by disabling last-resort comparison", tellssort
that we only want to compare lines in the way that we've specified; if two lines have the same above-defined "keys", then they're considered equivalent, even if they differ in other respects. Since we're using-u
, that means that means that one of them will be discarded. (If we weren't using-u
, it would just mean thatsort
wouldn't reorder them with respect to each other.)
Related Topics
Pass Command-Line Arguments to Grep as Search Patterns and Print Lines Which Match Them All
How to Pipe or Redirect the Output of Curl -V
Difference Between Unix Domain Stream and Datagram Sockets
Linux: Set Permission Only to Directories
Compress Files While Reading Data from Stdin
How to Skip Saturday and Sunday in a Cron Expression
How to Recursively List All Files and Directories
Git Merge Branch of Another Remote
Adding a Header into Multiple .Txt Files
Counting Number of Lines in a File Using Grep and Wc
Using a Remote Host's Usb Port as Local Usb (Linux and Windows)
Preserve Colouring After Piping Grep to Grep
How to Diff a Directory for Only Files of a Specific Type
How to Give a Linux User Sudo Access
Tool to Visualize the Device Tree File (Dtb) Used by the Linux Kernel