Find Unique Lines

How get unique lines from a very large file in linux?

Use sort -u instead of sort | uniq

This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

Find unique lines between two files

Try:

grep -Fvf file2 file1

This will print the lines which no whole or partially matched with the lines in file2.

How to print only the unique lines in BASH?

Using awk:

awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file
eagle
forest

Select unique or distinct values from a list in UNIX shell script

You might want to look at the uniq and sort applications.


./yourscript.ksh | sort | uniq

(FYI, yes, the sort is necessary in this command line, uniq only strips duplicate lines that are immediately after each other)

EDIT:

Contrary to what has been posted by Aaron Digulla in relation to uniq's commandline options:

Given the following input:


class
jar
jar
jar
bin
bin
java

uniq will output all lines exactly once:


class
jar
bin
java

uniq -d will output all lines that appear more than once, and it will print them once:


jar
bin

uniq -u will output all lines that appear exactly once, and it will print them once:


class
java

How to count the amount of unique lines, duplicate lines and lines that appear three times in a text file

$ echo 'Donald
Donald
Lisa
John
Lisa
Donald' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
1 1
1 2
1 3

The right column is the repetition count, and the left column is the number of unique names with that repetition count. E.g. “Donald” has a repetition count of 3.

Bigger example:

echo 'Donald
Donald
Rob
Lisa
WhatAmIDoing
John
Obama
Obama
Lisa
Washington
Donald' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
4 1
2 2
1 3

Four names (“Rob”, “WhatAmIDoing”, “John”, and “Washington”) each have a repetition count of 1. Two names (“Lisa” and “Obama”) each have a repetition count of 2. One name (“Donald”) has a repetition count of 3.

Extract All Unique Lines

Two nearly identical options:

Match All Lines That Are Not Repeated

(?sm)(^[^\r\n]+$)(?!.*^\1$)

The lines will be matched, but to extract them, you really want to replace the other ones.

Replace All Repeated Lines

This will work better in Notepad++:

Search: (?sm)(^[^\r\n]*)[\r\n](?=.*^\1)

Replace: empty string

  • (?s) activates DOTALL mode, allowing the dot to match across lines
  • (?m) turns on multi-line mode, allowing ^ and $ to match on each line
  • (^[^\r\n]*) captures a line to Group 1, i.e.
  • The ^ anchor asserts that we are at the beginning of the string
  • [^\r\n]* matches any chars that are not newline chars
  • [\r\n] matches the newline chars
  • The lookahead (?!.*^\1$) asserts that we can match any number of characters .*, then...
  • ^\1$ the same line as Group 1

Bash Script: count unique lines in file

You can use the uniq command to get counts of sorted repeated lines:

sort ips.txt | uniq -c

To get the most frequent results at top (thanks to Peter Jaric):

sort ips.txt | uniq -c | sort -bgr

Build a table of unique lines in a file and the number of times each unique line was observed

Are you working on a Unix system? (This answer won't work on Windows out of the box)

I created a file named testtext.txt with the content as follows:

c
a
b
a
b
b
b
c

Then executing the following command in the terminal

sort testtext.txt | uniq -c > testcounts.txt

generates a file, testcounts.txt with the content below.

2 a
4 b
2 c

I cant speak to how this will perform relative to other solutions, but seems be worth a shot.

You could also do it simultaneously across all files matching a pattern in the current directory - I made three - testtext.txt, testtext2.txt, and testtext3.txt

find . -type f -name 'testtext*' | xargs sort | uniq -c > Counts.txt

then creates the file Counts.txt

10 a
6 b
5 c
3 d
1 e
1 f

Alternatively (and particularly if memory usage is of concern) you could put the single file example in a simple bash script for loop to handle files one at a time. Either way, the Unix command line tools are shockingly efficient when used elegantly.

Credit: Unix.StackExchange: Sort and Count Number of Occurence of Lines on



Related Topics



Leave a reply



Submit