How to Filter Out All Unique Lines in a File

How do you filter out all unique lines in a file?

Remove duplicated lines:

awk '!a[$0]++' file

This is famous awk one-liner. there are many explanations on inet. Here is one explanation:

This one-liner is very idiomatic. It registers the lines seen in the
associative-array "a" (arrays are always associative in Awk) and at
the same time tests if it had seen the line before. If it had seen the
line before, then a[line] > 0 and !a[line] == 0. Any expression that
evaluates to false is a no-op, and any expression that evals to true
is equal to "{ print }".

How get unique lines from a very large file in linux?

Use sort -u instead of sort | uniq

This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

Extract All Unique Lines

Two nearly identical options:

Match All Lines That Are Not Repeated

(?sm)(^[^\r\n]+$)(?!.*^\1$)

The lines will be matched, but to extract them, you really want to replace the other ones.

Replace All Repeated Lines

This will work better in Notepad++:

Search: (?sm)(^[^\r\n]*)[\r\n](?=.*^\1)

Replace: empty string

(?s) activates DOTALL mode, allowing the dot to match across lines
(?m) turns on multi-line mode, allowing ^ and $ to match on each line
(^[^\r\n]*) captures a line to Group 1, i.e.
The ^ anchor asserts that we are at the beginning of the string
[^\r\n]* matches any chars that are not newline chars
[\r\n] matches the newline chars
The lookahead (?!.*^\1$) asserts that we can match any number of characters .*, then...
^\1$ the same line as Group 1

How to print only the unique lines in BASH?

Using awk:

awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file
eagle
forest

Extract unique lines from two sets of text files

Following might get you started.

Short version (using aliasses)

compare -r $(gc C:\a\*.txt | sort -u) -d $(gc C:\b\*.txt | sort -u) | 
    ? {$_.SideIndicator -eq '<='} | 
    select -expand inputobject | 
    Out-File unique.txt

Long version

Compare-Object -ReferenceObject $(Get-Content C:\a\*.txt | Sort-Object -Unique) -DifferenceObject $(Get-Content C:\b\*.txt | Sort-Object -Unique) | 
    Where-Object {$PSItem.SideIndicator -eq '<='} | 
    Select-Object -ExpandProperty inputobject | 
    Out-File unique.txt

Note that I can't shake the feeling that the comparison with <=can and should be handled better but I can't readily find a way.

Bash: Remove unique and keep duplicate

Use uniq -d to get a list of all the duplicate values, then filter the file so only those lines are included.

awk -F'\t' 'NR==FNR { dup[$0]; next; } 
     $15 in dup' <(awk -F'\t' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt

awk '{print $15}' file.txt | sort | uniq -d returns a list of all the duplicate values in column 15.

The NR==FNR line in the first awk script turns this into an associative array.

The second line processes file.txt and prints any lines where column 15 is in the array.