sort filename | uniq does not work on large files
You can normalize line delimeters (convert CR+LF to LF):
sed 's/\r//' big_list.txt | sort -u
Deduplicating lines in large file fails with sort and uniq
As per Kamil Cuk, let's try this solution:
sort -u myfile.json
Is the file really JSON? Sorting a JSON file can lead to dubious results. You may also try split'ing the file as suggested by Mark Setchell. You can then sort each split file, and sort the results. All sorts should be done with sort -u
.
Please provide some sample from myfile.json if indeed it is a JSON file. Let's here about your results when you just use sort -u
.
Why does sort -u give different output from sort filename | uniq -u?
file I use:
zsh/6 31167 % cat do_sortowania
Marcin
Tomek
Marcin
Wojtek
Zosia
Zosia
Marcin
Krzysiek
using sort:
zsh/6 31168 % sort -u do_sortowania
Krzysiek
Marcin
Tomek
Wojtek
Zosia
but using sort + uniq:
zsh/6 31170 % sort do_sortowania|uniq -u
Krzysiek
Tomek
Wojtek
Now: two answers:
Short:
zsh/6 31171 % sort do_sortowania|uniq -c
1 Krzysiek
3 Marcin
1 Tomek
1 Wojtek
2 Zosia
Long:
As you can see, quniq -u
return only lines that appear only one: Krzysiek, Tomek, Wojtek.
Zosia and Marcin apper 3 and 2 times so uniq -u
ommit them.
P.S.
zsh/6 31172 % cat do_sortowania|uniq -u
Marcin
Tomek
Marcin
Wojtek
Marcin
Krzysiek
becouse, sort
should works only on sorted files, so:
Marcin
Marcin
Tomek
will be uniq
ued to
Marcin
Tomek
but
Marcin
Tomek
Marcin
won't, becouse, sort compare row only to next one, becouse it's belive the file is sorted.
Finding a uniq -c substitute for big files
Use awk
awk '{c[$0]++} END {for (line in c) print c[line], line}' bigfile.txt
This is O(n) in time, and O(unique lines) in space.
How get unique lines from a very large file in linux?
Use sort -u
instead of sort | uniq
This allows sort
to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.
How to unique by column large file via sort command?
$ awk -F, '!array[$2]++' input_file
CONT,000-00-0000,GRAM
BEVE,507-66-6876,IGHT
CONT,111-11-1111,GRAM
Why isn't sort -u or uniq removing duplicates in concatenated text files?
Many thanks to @Andrew Henle - I knew it would be something simple!
Indeed, using hexdump -c combined2.txt
I saw that some lines ended with a \n
and some with \r\n
.
So I downloaded dos2unix
and ran
dos2unix combined2.txt
sort -u combine2.txt > combined3.txt
and it's all good!
Thanks again, Andrew!
efficient sort | uniq for the case of a large number of duplicates
I'm not sure what the performance difference will be, but you can replace the sort | uniq -c
with a simple awk
script. Since you have many duplicates and it hashes instead of sorting, I'd imagine it's faster:
awk '{c[$0]++}END{for(l in c){print c[l], l}}' input.txt | sort -n
Related Topics
Linux: How to Send a Whole Packet to a Specific Port on Another Host
How to Take Screenshot of Obscured Window in C++ on Linux
Version Control for My Web Server
How Limit Memory Usage for a Single Linux Process and Not Kill The Process
How to Use Unicode in Aspell Dictionary
How to Set Umask Default for an User
What's The Meaning of This Sed Command? Sed 's%^.*/%%'
How to Run My Own Script at Every Bootup
Reliable Bidirectional Communication to a Linux Process
Systemtap Script to Profile Latency of Functions
What Does Signal(Sigchld, Sig_Dfl); Mean
Some Flags About Workqueue in Kernel
Gfortran: Compiling 32-Bit Executable in 64-Bit System
Editing The Sudo File in a Shell Script
U-Boot: Cannot Boot Linux Kernel Despite Kernel Being Less Than Maximum Bootm_Len
Why Sizeof(Spinlock_T) Is Greater Than Zero on Uni-Processor