How do you filter out all unique lines in a file?
Remove duplicated lines:
awk '!a[$0]++' file
This is famous awk one-liner. there are many explanations on inet. Here is one explanation:
This one-liner is very idiomatic. It registers the lines seen in the
associative-array "a" (arrays are always associative in Awk) and at
the same time tests if it had seen the line before. If it had seen the
line before, then a[line] > 0 and !a[line] == 0. Any expression that
evaluates to false is a no-op, and any expression that evals to true
is equal to "{ print }".
How get unique lines from a very large file in linux?
Use sort -u
instead of sort | uniq
This allows sort
to discard duplicates earlier, and GNU coreutils is smart enough to take advantage of this.
Extract All Unique Lines
Two nearly identical options:
Match All Lines That Are Not Repeated
(?sm)(^[^\r\n]+$)(?!.*^\1$)
The lines will be matched, but to extract them, you really want to replace the other ones.
Replace All Repeated Lines
This will work better in Notepad++:
Search: (?sm)(^[^\r\n]*)[\r\n](?=.*^\1)
Replace: empty string
(?s)
activatesDOTALL
mode, allowing the dot to match across lines(?m)
turns on multi-line mode, allowing^
and$
to match on each line(^[^\r\n]*)
captures a line to Group 1, i.e.- The
^
anchor asserts that we are at the beginning of the string [^\r\n]*
matches any chars that are not newline chars[\r\n]
matches the newline chars- The lookahead
(?!.*^\1$)
asserts that we can match any number of characters.*
, then... ^\1$
the same line as Group 1
How to print only the unique lines in BASH?
Using awk:
awk '{!seen[$0]++};END{for(i in seen) if(seen[i]==1)print i}' file
eagle
forest
Extract unique lines from two sets of text files
Following might get you started.
Short version (using aliasses)
compare -r $(gc C:\a\*.txt | sort -u) -d $(gc C:\b\*.txt | sort -u) |
? {$_.SideIndicator -eq '<='} |
select -expand inputobject |
Out-File unique.txt
Long version
Compare-Object -ReferenceObject $(Get-Content C:\a\*.txt | Sort-Object -Unique) -DifferenceObject $(Get-Content C:\b\*.txt | Sort-Object -Unique) |
Where-Object {$PSItem.SideIndicator -eq '<='} |
Select-Object -ExpandProperty inputobject |
Out-File unique.txt
Note that I can't shake the feeling that the comparison with <=
can and should be handled better but I can't readily find a way.
Bash: Remove unique and keep duplicate
Use uniq -d
to get a list of all the duplicate values, then filter the file so only those lines are included.
awk -F'\t' 'NR==FNR { dup[$0]; next; }
$15 in dup' <(awk -F'\t' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt
awk '{print $15}' file.txt | sort | uniq -d
returns a list of all the duplicate values in column 15.
The NR==FNR
line in the first awk
script turns this into an associative array.
The second line processes file.txt
and prints any lines where column 15 is in the array.
Related Topics
How to Get the Start Time of a Long-Running Linux Process
Start Script After Another One (Already Running) Finishes
How to Establish Ssl Connection Upon Wget on Ubuntu 14.04 Lts
Why Doesn't "Sort File1 > File1" Work
Rsync Copy Over Only Certain Types of Files Using Include Option
Difference Between Posix Aio and Libaio on Linux
Default Field Separator for Awk
Faster Forking of Large Processes on Linux
How to Pass a Wildcard Parameter to a Bash File
Count Number of Files Within a Directory in Linux
Are There Standards for Linux Command Line Switches and Arguments
How to Program Linux .Dts Device Tree Files
Receiving Key Press and Key Release Events in Linux Terminal Applications