Get Lines of File1 Which Are Not in File2

Get lines of file1 which are not in file2

This is what the comm command is for:

$ comm -3 file1 file2
0002_bcc

From man comm:

DESCRIPTION

Compare sorted files FILE1 and FILE2 line by line.

With no options, produce three-column output. Column one contains
lines unique to FILE1, column two contains lines unique to FILE2, and
column three contains lines common to both files.

-1 suppress column 1 (lines unique to FILE1)

-2 suppress column 2 (lines unique to FILE2)

-3 suppress column 3 (lines that appear in both files)

Fast way of finding lines in one file that are not in another?

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.


Explanation

The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.

If you are familiar with unified diff format, you can partly recreate it with:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
--new-line-format="+%L" file1 file2

The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the --- +++ and @@ lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn.


The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.

# output lines in file1 that are not in file2
BEGIN { FS="" } # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; } # file1, index by lineno
(NR!=FNR) { ss2[$0]++; } # file2, index by string
END {
for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

BEGIN { FS="" }
(NR==FNR) { # file1, index by lineno and string
ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) { # file2
if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.

Output line from file1 if not found in file2

grep -vhFxf file1 file2

Works great.

Show lines in unsorted file1 that do not exist in file2

Many thanks for the help and suggestion to use comm, but meanwhile I found the answer using grep. I rather prefer grep, since it's much faster than comm and also does not require the input to be sorted.

$ fgrep -v -x -f file2 file1
chmod -f 644 /root/testme
chown -h root:root /root/testme

The solution is simply to add -x:

-x Select only those matches that exactly match the whole line

Compare file1 and file2 but show only new lines which are not in file2

This should work:

awk -F= 'NR==FNR{a[$1]=$0;next}!($1 in a)' file2 file1
Austria=Wien

We read entire file2 first indexed at countries. We check if the country is not present in our file1 and print it. This won't give you results of lines which are in file2 but not in file1, but can be adjusted to give you that as well. I am not sure if that is your requirement. If it is then please update your question to reflect all your use-cases for more complete answer.

Find lines from one file that do not appear (even partially) in another file

One in awk, mawk is probably the fastest so use that one:

$ awk '
NR==FNR { # process file1
a[$0] # hash all records to memory
next # process next record
}
{ # process file2
for(i in a) # for each file1 entry in memory
if($0 ~ i) # see if it is found in current file2 record
delete a[i] # and delete if found
}
END { # in the end
for(i in a) # all left from file1
print i # are outputted
}' file1 file2 # mind the order

Output:

bbb

Fastest way to find lines in file1 which contains any keywords from file2?

Give this a shot.

Test Data:

%_Host@User> head file1.txt file2.txt
==> file1.txt <==
server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
server1:user4:x:14598:24:User Four:/users/user4:/bin/bash |
server1:user5:x:14598:24:User Five:/users/user5:/bin/bash |

==> file2.txt <==
user1
user2
user3
#user4
%_Host@User>

Output:

    %_Host@User> ./2comp.pl file1.txt file2.txt   ; cat output_comp
server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
%_Host@User>
%_Host@User>

Script: Please give this one more try. Re-check the file order. File1 first and then file second: ./2comp.pl file1.txt file2.txt.

%_Host@User> cat 2comp.pl
#!/usr/bin/perl

use strict ;
use warnings ;
use Data::Dumper ;

my ($file2,$file1,$output) = (@ARGV,"output_comp") ;
my (%hash,%tmp) ;

(scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ;

for (@ARGV) {
open FH, "<$_" || die "Cannot open $_\n" ;
while (my $line = <FH>){$line =~ s/^.+[()].+$| +?$//g ; chomp $line ; $hash{$_}{$line} = "$line"}
close FH ;}

open FH, ">>$output" || die "Cannot open outfile!\n" ;
foreach my $k1 (keys %{$hash{$file1}}){
foreach my $k2 (keys %{$hash{$file2}}){
if ($k2 =~ m/^.+?$k1.+?$/i){ # Case Insensitive matching.
if (!defined $tmp{"$hash{$file2}{$k2}"}){
print FH "$hash{$file2}{$k2}\n" ;
$tmp{"$hash{$file2}{$k2}"} = 1 ;
}}}} close FH ;
# End.
%_Host@User>

Thanks good luck.

Grep lines in file1 if not found in file2

You could use comm command,

$ comm -23 file1 file2
bbbb

It's better to sort the files before fedding it to the comm command,

comm -23 <(sort file1) <(sort file2)

How can I print lines from file1 and file2 where columns 9 in file 1 is less than column 4 in file 2

your key from file 1 is field 2, not 1.

$ awk 'NR==FNR {a[$2]=$0; next} 
$1 in a {split(a[$1],t);
if(t[9]>=$4 && t[10]<=$5) print a[$1], $0}' file1 file2 | column -t

BG chr2 100.000 15 0 0 1 15 216745730 216745744 5.1 chr2 hg38_refGene exon 216645730 216845744
BG chr1 100.000 15 0 0 1 15 6195235 6195335 5.1 chr1 hg38_refGene CDS 6095235 6395421


Related Topics



Leave a reply



Submit