"Sort Filename | Uniq" Does Not Work on Large Files

sort filename | uniq does not work on large files

You can normalize line delimeters (convert CR+LF to LF):

sed 's/\r//' big_list.txt | sort -u

Deduplicating lines in large file fails with sort and uniq

As per Kamil Cuk, let's try this solution:

sort -u myfile.json 

Is the file really JSON? Sorting a JSON file can lead to dubious results. You may also try split'ing the file as suggested by Mark Setchell. You can then sort each split file, and sort the results. All sorts should be done with sort -u.

Please provide some sample from myfile.json if indeed it is a JSON file. Let's here about your results when you just use sort -u.

Why does sort -u give different output from sort filename | uniq -u?

file I use:

zsh/6 31167 % cat do_sortowania  
Marcin
Tomek
Marcin
Wojtek
Zosia
Zosia
Marcin
Krzysiek

using sort:

zsh/6 31168 % sort -u do_sortowania 
Krzysiek
Marcin
Tomek
Wojtek
Zosia

but using sort + uniq:

zsh/6 31170 % sort do_sortowania|uniq -u
Krzysiek
Tomek
Wojtek

Now: two answers:
Short:

zsh/6 31171 % sort do_sortowania|uniq -c
1 Krzysiek
3 Marcin
1 Tomek
1 Wojtek
2 Zosia

Long:
As you can see, quniq -u return only lines that appear only one: Krzysiek, Tomek, Wojtek.

Zosia and Marcin apper 3 and 2 times so uniq -u ommit them.

P.S.

zsh/6 31172 % cat do_sortowania|uniq -u
Marcin
Tomek
Marcin
Wojtek
Marcin
Krzysiek

becouse, sort should works only on sorted files, so:

Marcin
Marcin
Tomek

will be uniqued to

Marcin
Tomek

but

Marcin
Tomek
Marcin

won't, becouse, sort compare row only to next one, becouse it's belive the file is sorted.

Finding a uniq -c substitute for big files

Use awk

awk '{c[$0]++} END {for (line in c) print c[line], line}' bigfile.txt

This is O(n) in time, and O(unique lines) in space.

How get unique lines from a very large file in linux?

Use sort -u instead of sort | uniq

This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

How to unique by column large file via sort command?

$ awk -F, '!array[$2]++' input_file
CONT,000-00-0000,GRAM
BEVE,507-66-6876,IGHT
CONT,111-11-1111,GRAM

Why isn't sort -u or uniq removing duplicates in concatenated text files?

Many thanks to @Andrew Henle - I knew it would be something simple!

Indeed, using hexdump -c combined2.txt I saw that some lines ended with a \n and some with \r\n.

So I downloaded dos2unix and ran

dos2unix combined2.txt
sort -u combine2.txt > combined3.txt

and it's all good!

Thanks again, Andrew!

efficient sort | uniq for the case of a large number of duplicates

I'm not sure what the performance difference will be, but you can replace the sort | uniq -c with a simple awk script. Since you have many duplicates and it hashes instead of sorting, I'd imagine it's faster:

 awk '{c[$0]++}END{for(l in c){print c[l], l}}' input.txt | sort -n


Related Topics



Leave a reply



Submit