How get unique lines from a very large file in linux?
Use sort -u
instead of sort | uniq
This allows sort
to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.
Find unique lines between two files
Try:
grep -Fvf file2 file1
This will print the lines which no whole or partially matched with the lines in file2.
How to print only the unique lines in BASH?
Using awk:
awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file
eagle
forest
Select unique or distinct values from a list in UNIX shell script
You might want to look at the uniq
and sort
applications.
./yourscript.ksh | sort | uniq
(FYI, yes, the sort is necessary in this command line, uniq
only strips duplicate lines that are immediately after each other)
EDIT:
Contrary to what has been posted by Aaron Digulla in relation to uniq
's commandline options:
Given the following input:
class
jar
jar
jar
bin
bin
java
uniq
will output all lines exactly once:
class
jar
bin
java
uniq -d
will output all lines that appear more than once, and it will print them once:
jar
bin
uniq -u
will output all lines that appear exactly once, and it will print them once:
class
java
How to count the amount of unique lines, duplicate lines and lines that appear three times in a text file
$ echo 'Donald
Donald
Lisa
John
Lisa
Donald' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
1 1
1 2
1 3
The right column is the repetition count, and the left column is the number of unique names with that repetition count. E.g. “Donald” has a repetition count of 3.
Bigger example:
echo 'Donald
Donald
Rob
Lisa
WhatAmIDoing
John
Obama
Obama
Lisa
Washington
Donald' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
4 1
2 2
1 3
Four names (“Rob”, “WhatAmIDoing”, “John”, and “Washington”) each have a repetition count of 1. Two names (“Lisa” and “Obama”) each have a repetition count of 2. One name (“Donald”) has a repetition count of 3.
Extract All Unique Lines
Two nearly identical options:
Match All Lines That Are Not Repeated
(?sm)(^[^\r\n]+$)(?!.*^\1$)
The lines will be matched, but to extract them, you really want to replace the other ones.
Replace All Repeated Lines
This will work better in Notepad++:
Search: (?sm)(^[^\r\n]*)[\r\n](?=.*^\1)
Replace: empty string
(?s)
activatesDOTALL
mode, allowing the dot to match across lines(?m)
turns on multi-line mode, allowing^
and$
to match on each line(^[^\r\n]*)
captures a line to Group 1, i.e.- The
^
anchor asserts that we are at the beginning of the string [^\r\n]*
matches any chars that are not newline chars[\r\n]
matches the newline chars- The lookahead
(?!.*^\1$)
asserts that we can match any number of characters.*
, then... ^\1$
the same line as Group 1
Bash Script: count unique lines in file
You can use the uniq
command to get counts of sorted repeated lines:
sort ips.txt | uniq -c
To get the most frequent results at top (thanks to Peter Jaric):
sort ips.txt | uniq -c | sort -bgr
Build a table of unique lines in a file and the number of times each unique line was observed
Are you working on a Unix system? (This answer won't work on Windows out of the box)
I created a file named testtext.txt
with the content as follows:
c
a
b
a
b
b
b
c
Then executing the following command in the terminal
sort testtext.txt | uniq -c > testcounts.txt
generates a file, testcounts.txt
with the content below.
2 a
4 b
2 c
I cant speak to how this will perform relative to other solutions, but seems be worth a shot.
You could also do it simultaneously across all files matching a pattern in the current directory - I made three - testtext.txt
, testtext2.txt
, and testtext3.txt
find . -type f -name 'testtext*' | xargs sort | uniq -c > Counts.txt
then creates the file Counts.txt
10 a
6 b
5 c
3 d
1 e
1 f
Alternatively (and particularly if memory usage is of concern) you could put the single file example in a simple bash script for loop to handle files one at a time. Either way, the Unix command line tools are shockingly efficient when used elegantly.
Credit: Unix.StackExchange: Sort and Count Number of Occurence of Lines on
Related Topics
Mixing Static Libraries and Shared Libraries
Why Does Find -Exec Mv {} ./Target/ + Not Work
How Use Qt in Visual Studio Code
What Is the Use of _Iomem in Linux While Writing Device Drivers
How to Get All Parent Processes and All Subprocesses with 'Pstree'
How to Build a Docker Image on Windows 10
How to Move a Relative Symbolic Link
Adding Support for Menuconfig/Kconfig in My Project
Shell Prompt That Is Based on Location in Filesystem
Calculate and Print the Average Value of Strings in a Column
Remove Redundant Paths from $Path Variable
How to Remove Duplicate Words from a Plain Text File Using Linux Command
Avrisp Mkii Doesn't Work with Avrdude on Linux
A Running Bash Script Is Hung Somewhere. How to Find Out What Line It Is On
Aws-Ec2, How to Set Multiple Public Sites with Just One Instance