How to Filter Out All Unique Lines in a File

How do you filter out all unique lines in a file?

Remove duplicated lines:

awk '!a[$0]++' file

This is famous awk one-liner. there are many explanations on inet. Here is one explanation:

This one-liner is very idiomatic. It registers the lines seen in the
associative-array "a" (arrays are always associative in Awk) and at
the same time tests if it had seen the line before. If it had seen the
line before, then a[line] > 0 and !a[line] == 0. Any expression that
evaluates to false is a no-op, and any expression that evals to true
is equal to "{ print }".

How get unique lines from a very large file in linux?

Use sort -u instead of sort | uniq

This allows sort to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.

Extract All Unique Lines

Two nearly identical options:

Match All Lines That Are Not Repeated

(?sm)(^[^\r\n]+$)(?!.*^\1$)

The lines will be matched, but to extract them, you really want to replace the other ones.

Replace All Repeated Lines

This will work better in Notepad++:

Search: (?sm)(^[^\r\n]*)[\r\n](?=.*^\1)

Replace: empty string

  • (?s) activates DOTALL mode, allowing the dot to match across lines
  • (?m) turns on multi-line mode, allowing ^ and $ to match on each line
  • (^[^\r\n]*) captures a line to Group 1, i.e.
  • The ^ anchor asserts that we are at the beginning of the string
  • [^\r\n]* matches any chars that are not newline chars
  • [\r\n] matches the newline chars
  • The lookahead (?!.*^\1$) asserts that we can match any number of characters .*, then...
  • ^\1$ the same line as Group 1

How to print only the unique lines in BASH?

Using awk:

awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file
eagle
forest

Extract unique lines from two sets of text files

Following might get you started.

Short version (using aliasses)

compare -r $(gc C:\a\*.txt | sort -u) -d $(gc C:\b\*.txt | sort -u) | 
? {$_.SideIndicator -eq '<='} |
select -expand inputobject |
Out-File unique.txt

Long version

Compare-Object -ReferenceObject $(Get-Content C:\a\*.txt | Sort-Object -Unique) -DifferenceObject $(Get-Content C:\b\*.txt | Sort-Object -Unique) | 
Where-Object {$PSItem.SideIndicator -eq '<='} |
Select-Object -ExpandProperty inputobject |
Out-File unique.txt

Note that I can't shake the feeling that the comparison with <=can and should be handled better but I can't readily find a way.

Bash: Remove unique and keep duplicate

Use uniq -d to get a list of all the duplicate values, then filter the file so only those lines are included.

awk -F'\t' 'NR==FNR { dup[$0]; next; } 
$15 in dup' <(awk -F'\t' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt

awk '{print $15}' file.txt | sort | uniq -d returns a list of all the duplicate values in column 15.

The NR==FNR line in the first awk script turns this into an associative array.

The second line processes file.txt and prints any lines where column 15 is in the array.



Related Topics



Leave a reply



Submit