Check If All Lines from One File Are Present Somewhere in Another File

Fast way of finding lines in one file that are not in another?

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.


Explanation

The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.

If you are familiar with unified diff format, you can partly recreate it with:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
--new-line-format="+%L" file1 file2

The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the --- +++ and @@ lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn.


The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.

# output lines in file1 that are not in file2
BEGIN { FS="" } # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; } # file1, index by lineno
(NR!=FNR) { ss2[$0]++; } # file2, index by string
END {
for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

BEGIN { FS="" }
(NR==FNR) { # file1, index by lineno and string
ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) { # file2
if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.

Finding contents of one file in another file

grep itself is able to do so. Simply use the flag -f:

grep -f <patterns> <file>

<patterns> is a file containing one pattern in each line; and <file> is the file in which you want to search things.

Note that, to force grep to consider each line a pattern, even if the contents of each line look like a regular expression, you should use the flag -F, --fixed-strings.

grep -F -f <patterns> <file>

If your file is a CSV, as you said, you may do:

grep -f <(tr ',' '\n' < data.csv) <file>

As an example, consider the file "a.txt", with the following lines:

alpha
0891234
beta

Now, the file "b.txt", with the lines:

Alpha
0808080
0891234
bEtA

The output of the following command is:

grep -f "a.txt" "b.txt"
0891234

You don't need at all to for-loop here; grep itself offers this feature.


Now using your file names:

#!/bin/bash
patterns="/home/nimish/contents.txt"
search="/home/nimish/another_file.csv"
grep -f <(tr ',' '\n' < "${patterns}") "${search}"

You may change ',' to the separator you have in your file.

How to find words from one file in another file?

You can use grep -f:

grep -Ff "first-file" "second-file"

OR else to match full words:

grep -w -Ff "first-file" "second-file"

UPDATE: As per the comments:

awk 'FNR==NR{a[$1]; next} ($1 in a){delete a[$1]; print $1}' file1 file2

Find content of one file from another file in UNIX

One way with awk:

awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2

This should be pretty quick. On my machine, it took under 2 seconds to create a lookup of 1 million entries and compare it against 3 million lines.

Machine Specs:

Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (8 cores)
98 GB RAM

Find lines from a file which are not present in another file

The command you have to use is not diff but comm

comm -23 a.txt b.txt

By default, comm outputs 3 columns: left-only, right-only, both. The -1, -2 and -3 switches suppress these columns.

So, -23 hides the right-only and both columns, showing the lines that appear only in the first (left) file.

If you want to find lines that appear in both, you can use -12, which hides the left-only and right-only columns, leaving you with just the both column.

Deleting lines from one file which are in another file

grep -v -x -f f2 f1 should do the trick.

Explanation:

  • -v to select non-matching lines
  • -x to match whole lines only
  • -f f2 to get patterns from f2

One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).

Check if all of multiple strings or regexes exist in a file

Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.

In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:

awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file

And here's a bunch of other information and options:

Assuming you're really looking for strings, it'd be:

awk -v strings='string1 string2 string3' '
BEGIN {
numStrings = split(strings,tmp)
for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
for (str in strs) {
if ( index($0,str) ) {
delete strs[str]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file

the above will stop reading the file as soon as all strings have matched.

If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:

awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file

Actually, even if it were strings you could do:

awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file

The main issue with the above 2 GNU awk solutions is that, like @anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.

I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:

awk '
NR==FNR { strings[$0]; next }
{
for (string in strings)
if ( !index($0,string) )
exit 1
}
' file_of_strings RS='^$' file_to_be_searched

and for regexps it'd be:

awk '
NR==FNR { regexps[$0]; next }
{
for (regexp in regexps)
if ( $0 !~ regexp )
exit 1
}
' file_of_regexps RS='^$' file_to_be_searched

If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0' instead of RS='^$' or by appending to variable one line at a time as it's read and then processing that variable in the END section.

If your file_to_be_searched is too large to fit in memory then it'd be this for strings:

awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
for (string in strings) {
if ( index($0,string) ) {
delete strings[string]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched

and the equivalent for regexps:

awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
for (regexp in regexps) {
if ( $0 ~ regexp ) {
delete regexps[regexp]
numRegexps--
}
}
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched

Fastest way to find lines of a file from another larger file in Bash

A small piece of Perl code solved the problem. This is the approach taken:

  • store the lines of file1.txt in a hash
  • read file2.txt line by line, parse and extract the second field
  • check if the extracted field is in the hash; if so, print the line

Here is the code:

#!/usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 2) {
printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
exit(2);
}

my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);

open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!;

# store contents of small file in a hash
while (<$small_fp>) {
chomp;
$small_hash{$_} = undef;
}
close($small_fp);

# loop through big file and find matches
while (<$big_fp>) {
# no need for chomp
$field = (split(/\|/, $_))[1];
if (defined($field) && exists($small_hash{$field})) {
printf("%s", $_);
}
}

close($big_fp);
exit(0);

I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the time output for the same:

real    0m11.694s
user 0m11.507s
sys 0m0.174s

I ran @Inian's awk code:

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt and producing 40K matched lines. Here is how long it took:

awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
input record number 675280, file file2.txt
source line number 1

real 55m5.539s
user 54m53.080s
sys 0m5.095s

Using @Inian's other awk solution, which eliminates the looping issue:

time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out

real 0m39.966s
user 0m37.916s
sys 0m0.743s

time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out

real 0m41.057s
user 0m38.475s
sys 0m0.904s

awk is very impressive here, given that we didn't have to write an entire program to do it.

I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time output:

real    895m14.862s
user 806m59.219s
sys 1m12.147s

I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted error, even with very small block sizes.


What surprised me was that fgrep was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep had an option to force the content of -f file to be kept in a hash, just like what the Perl code did.

I didn't check join approach - I didn't want the additional overhead of sorting the files. Also, given fgrep's poor performance, I don't believe join would have done better than the Perl code.

Thanks everyone for your attention and responses.



Related Topics



Leave a reply



Submit