Fastest Way to Find Lines of a File from Another Larger File in Bash

Fastest way to find lines of a file from another larger file in Bash

A small piece of Perl code solved the problem. This is the approach taken:

store the lines of file1.txt in a hash
read file2.txt line by line, parse and extract the second field
check if the extracted field is in the hash; if so, print the line

Here is the code:

#!/usr/bin/perl -w

use strict;
if (scalar(@ARGV) != 2) {
  printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
  exit(2);
}

my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);

open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file)     || die "Can't open $big_file: "   . $!;

# store contents of small file in a hash
while (<$small_fp>) {
  chomp;
  $small_hash{$_} = undef;
}
close($small_fp);

# loop through big file and find matches
while (<$big_fp>) {
  # no need for chomp
  $field = (split(/\|/, $_))[1];
  if (defined($field) && exists($small_hash{$field})) {
    printf("%s", $_);
  }
}

close($big_fp);
exit(0);

I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the time output for the same:

real    0m11.694s
user    0m11.507s
sys 0m0.174s

I ran @Inian's awk code:

awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt

It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt and producing 40K matched lines. Here is how long it took:

awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
 input record number 675280, file file2.txt
 source line number 1

real    55m5.539s
user    54m53.080s
sys 0m5.095s

Using @Inian's other awk solution, which eliminates the looping issue:

time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out

real    0m39.966s
user    0m37.916s
sys 0m0.743s

time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out

real    0m41.057s
user    0m38.475s
sys 0m0.904s

awk is very impressive here, given that we didn't have to write an entire program to do it.

I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time output:

real    895m14.862s
user    806m59.219s
sys 1m12.147s

I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted error, even with very small block sizes.

What surprised me was that fgrep was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep had an option to force the content of -f file to be kept in a hash, just like what the Perl code did.

I didn't check join approach - I didn't want the additional overhead of sorting the files. Also, given fgrep's poor performance, I don't believe join would have done better than the Perl code.

Thanks everyone for your attention and responses.

Fastest way to find lines in file1 which contains any keywords from file2?

Give this a shot.

Test Data:

%_Host@User> head file1.txt file2.txt
==> file1.txt <==
server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
server1:user4:x:14598:24:User Four:/users/user4:/bin/bash |
server1:user5:x:14598:24:User Five:/users/user5:/bin/bash |

==> file2.txt <==
user1
user2
user3
#user4
%_Host@User>

Output:

    %_Host@User> ./2comp.pl file1.txt file2.txt   ; cat output_comp
    server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
    server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
    server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
    %_Host@User>
    %_Host@User>

Script: Please give this one more try. Re-check the file order. File1 first and then file second: ./2comp.pl file1.txt file2.txt.

%_Host@User> cat 2comp.pl
#!/usr/bin/perl

use strict ;
use warnings ;
use Data::Dumper ;

my ($file2,$file1,$output) = (@ARGV,"output_comp") ;
my (%hash,%tmp) ;

(scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ;

for (@ARGV) {
  open FH, "<$_" || die "Cannot open $_\n" ;
  while (my $line = <FH>){$line =~ s/^.+[()].+$| +?$//g ; chomp $line ; $hash{$_}{$line} = "$line"}
  close FH ;}

open FH, ">>$output" || die "Cannot open outfile!\n" ;
foreach my $k1 (keys %{$hash{$file1}}){
  foreach my $k2 (keys %{$hash{$file2}}){
    if ($k2 =~ m/^.+?$k1.+?$/i){    # Case Insensitive matching.
      if (!defined $tmp{"$hash{$file2}{$k2}"}){
        print FH "$hash{$file2}{$k2}\n" ;
        $tmp{"$hash{$file2}{$k2}"} = 1 ;
                }}}} close FH  ;
# End.
%_Host@User>

Thanks good luck.

Fastest way to find lines of a file based on column from another larger file in awk

It's a typical awk multiple files processing problem.

The command: (I think the code explains itself)

awk -F',' ' fname != FILENAME{fname=FILENAME;idx++}
    idx == 1 {a[$1]}
    idx>1 && $1 in a{print > (fname".out")} ' 1.csv C_*.csv

Note:

I tested it with gawk
I simplified your filename a bit. C_*.csv you can replace it with the real pattern
Your CLEAN_*.csv files have leading spaces on each line. I've removed them. If your input data really contain those spaces, we need to handle it. It's pretty easy, but I guess the spaces come from your copy&paste.
After you run the command, you will get 10 (if there are 10 CLEANUP_..csv files, as you said) CLEANUP_....csv.out files, which contain the matched lines
If you have a lot of input files, you may want to call close(fname".out") before switching to a new input file. Or you may get something like "too many opened files" err-msg.

If I run it with your shortened examples, I got:

kent$  head *.out
==> C_1.csv.out <==
15192151898, 732539481
12042050161, 770565711

==> c_2.csv.out <==
12042050198, 592494702
12042050158, 765193222

Fast way of finding lines in one file that are not in another?

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.

Explanation

The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.

If you are familiar with unified diff format, you can partly recreate it with:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
     --new-line-format="+%L" file1 file2

The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the --- +++ and @@ lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn.

The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.

# output lines in file1 that are not in file2
BEGIN { FS="" }                         # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; }     # file1, index by lineno
(NR!=FNR) { ss2[$0]++; }                # file2, index by string
END {
    for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

BEGIN { FS="" }
(NR==FNR) {  # file1, index by lineno and string
  ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) {  # file2
  if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
  for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.

Stream filter large number of lines that are specified by line number from stdin

Keeping with OP's current idea:

xz -dcq huge.txt.xz | awk '!firstfile_proceed { nums[$1]; next } (FNR in nums)' select.txt firstfile_proceed=1 -

Where the - (at the end of the line) tells awk to read from stdin (in this case the output from xz that's being piped to the awk call).

Another way to do this (replaces all of the above code):

awk '
FNR==NR { nums[$1]; next }             # process first file
FNR in nums                            # process subsequent file(s)
' select.txt <(xz -dcq huge.txt.xz)

Comments removed and cut down to a 'one-liner':

awk 'FNR==NR {nums[$1];next} FNR in nums' select.txt <(xz -dcq huge.txt.xz)

Adding some logic to implement Ed Morton's comment (exit processing once FNR > largest value from select.txt):

awk '
# process first file

FNR==NR      { nums[$1]
               maxFNR= ($1>maxFNR ? $1 : maxFNR)
               next
             }

# process subsequent file(s):

FNR > maxFNR { exit }
FNR in nums
' select.txt <(xz -dcq huge.txt.xz)

NOTES:

keeping in mind we're talking about scanning millions of lines of input ...
FNR > maxFNR will obviously add some cpu/processing time to the overall operation (though less time than FNR in nums)
if the operation routinely needs to pull rows from, say, the last 25% of the file then FNR > maxFNR is likely providing little benefit (and probably slowing down the operation)
if the operation routinely finds all desired rows in, say, the first 50% of the file then FNR> maxFNR is probably worth the cpu/processing time to keep from scanning the entire input stream (then again, the xz operation, on the entire file, is likely the biggest time consumer)
net result: the additional NFR > maxFNR test may speed-up/slow-down the overall process depending on how much of the input stream needs to be processed in a typical run; OP would need to run some tests to see if there's a (noticeable) difference in overall runtime

Getting one line in a huge file with bash

If all the lines have the same length, the best way by far will be to use dd(1) and give it a skip parameter.

Let the block size be the length of each line (including the newline), then you can do:

$ dd if=filename bs=<line-length> skip=<line_no - 1> count=1 2>/dev/null

The idea is to seek past all the previous lines (skip=<line_no - 1>) and read a single line (count=1). Because the block size is set to the line length (bs=<line-length>), each block is effectively a single line. Redirect stderr so you don't get the annoying stats at the end.

That should be much more efficient than streaming the lines before the one you want through a program to read all the lines and then throw them away, as dd will seek to the position you want in the file and read only one line of data from the file.

Find lines from one file that do not appear (even partially) in another file

One in awk, mawk is probably the fastest so use that one:

$ awk '
NR==FNR {                # process file1
    a[$0]                # hash all records to memory
    next                 # process next record
}
{                        # process file2
    for(i in a)          # for each file1 entry in memory
        if($0 ~ i)       # see if it is found in current file2 record
            delete a[i]  # and delete if found
}
END {                    # in the end
    for(i in a)          # all left from file1
        print i          # are outputted
}' file1 file2           # mind the order

Output:

bbb

Fastest Way to Find Lines of a File from Another Larger File in Bash