Fastest way to find lines of a file from another larger file in Bash
A small piece of Perl code solved the problem. This is the approach taken:
- store the lines of
file1.txt
in a hash - read
file2.txt
line by line, parse and extract the second field - check if the extracted field is in the hash; if so, print the line
Here is the code:
#!/usr/bin/perl -w
use strict;
if (scalar(@ARGV) != 2) {
printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
exit(2);
}
my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);
open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!;
# store contents of small file in a hash
while (<$small_fp>) {
chomp;
$small_hash{$_} = undef;
}
close($small_fp);
# loop through big file and find matches
while (<$big_fp>) {
# no need for chomp
$field = (split(/\|/, $_))[1];
if (defined($field) && exists($small_hash{$field})) {
printf("%s", $_);
}
}
close($big_fp);
exit(0);
I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the time
output for the same:
real 0m11.694s
user 0m11.507s
sys 0m0.174s
I ran @Inian's awk
code:
awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt
and producing 40K matched lines. Here is how long it took:
awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
input record number 675280, file file2.txt
source line number 1
real 55m5.539s
user 54m53.080s
sys 0m5.095s
Using @Inian's other awk
solution, which eliminates the looping issue:
time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out
real 0m39.966s
user 0m37.916s
sys 0m0.743s
time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out
real 0m41.057s
user 0m38.475s
sys 0m0.904s
awk
is very impressive here, given that we didn't have to write an entire program to do it.
I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time
output:
real 895m14.862s
user 806m59.219s
sys 1m12.147s
I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted
error, even with very small block sizes.
What surprised me was that fgrep
was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep
had an option to force the content of -f file
to be kept in a hash, just like what the Perl code did.
I didn't check join
approach - I didn't want the additional overhead of sorting the files. Also, given fgrep
's poor performance, I don't believe join
would have done better than the Perl code.
Thanks everyone for your attention and responses.
Fastest way to find lines in file1 which contains any keywords from file2?
Give this a shot.
Test Data:
%_Host@User> head file1.txt file2.txt
==> file1.txt <==
server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
server1:user4:x:14598:24:User Four:/users/user4:/bin/bash |
server1:user5:x:14598:24:User Five:/users/user5:/bin/bash |
==> file2.txt <==
user1
user2
user3
#user4
%_Host@User>
Output:
%_Host@User> ./2comp.pl file1.txt file2.txt ; cat output_comp
server1:user1:x:13621:22324:User One:/users/user1:/bin/ksh |
server1:user3:x:14598:24:User three:/users/user3:/bin/bash |
server1:user2:x:14537:100:User two:/users/user2:/bin/bash |
%_Host@User>
%_Host@User>
Script: Please give this one more try. Re-check the file order. File1 first and then file second: ./2comp.pl file1.txt file2.txt
.
%_Host@User> cat 2comp.pl
#!/usr/bin/perl
use strict ;
use warnings ;
use Data::Dumper ;
my ($file2,$file1,$output) = (@ARGV,"output_comp") ;
my (%hash,%tmp) ;
(scalar @ARGV != 2 ? (print "Need 2 files!\n") : ()) ? exit 1 : () ;
for (@ARGV) {
open FH, "<$_" || die "Cannot open $_\n" ;
while (my $line = <FH>){$line =~ s/^.+[()].+$| +?$//g ; chomp $line ; $hash{$_}{$line} = "$line"}
close FH ;}
open FH, ">>$output" || die "Cannot open outfile!\n" ;
foreach my $k1 (keys %{$hash{$file1}}){
foreach my $k2 (keys %{$hash{$file2}}){
if ($k2 =~ m/^.+?$k1.+?$/i){ # Case Insensitive matching.
if (!defined $tmp{"$hash{$file2}{$k2}"}){
print FH "$hash{$file2}{$k2}\n" ;
$tmp{"$hash{$file2}{$k2}"} = 1 ;
}}}} close FH ;
# End.
%_Host@User>
Thanks good luck.
Fastest way to find lines of a file based on column from another larger file in awk
It's a typical awk multiple files processing problem.
The command: (I think the code explains itself)
awk -F',' ' fname != FILENAME{fname=FILENAME;idx++}
idx == 1 {a[$1]}
idx>1 && $1 in a{print > (fname".out")} ' 1.csv C_*.csv
Note:
- I tested it with gawk
- I simplified your filename a bit.
C_*.csv
you can replace it with the real pattern - Your
CLEAN_*.csv
files have leading spaces on each line. I've removed them. If your input data really contain those spaces, we need to handle it. It's pretty easy, but I guess the spaces come from your copy&paste. - After you run the command, you will get 10 (if there are 10
CLEANUP_..csv
files, as you said)CLEANUP_....csv.out
files, which contain the matched lines - If you have a lot of input files, you may want to call
close(fname".out")
before switching to a new input file. Or you may get something like "too many opened files" err-msg.
If I run it with your shortened examples, I got:
kent$ head *.out
==> C_1.csv.out <==
15192151898, 732539481
12042050161, 770565711
==> c_2.csv.out <==
12042050198, 592494702
12042050158, 765193222
Fast way of finding lines in one file that are not in another?
You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff
output:
diff --new-line-format="" --unchanged-line-format="" file1 file2
The input files should be sorted for this to work. With bash
(and zsh
) you can sort in-place with process substitution <( )
:
diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)
In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff
options that other solutions don't offer, such as -i
to ignore case, or various whitespace options (-E
, -b
, -v
etc) for less strict matching.
Explanation
The options --new-line-format
, --old-line-format
and --unchanged-line-format
let you control the way diff
formats the differences, similar to printf
format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.
If you are familiar with unified diff format, you can partly recreate it with:
diff --old-line-format="-%L" --unchanged-line-format=" %L" \
--new-line-format="+%L" file1 file2
The %L
specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the ---
+++
and @@
lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn
.
The diff
method (along with other suggestions comm
and join
) only produce the expected output with sorted input, though you can use <(sort ...)
to sort in place. Here's a simple awk
(nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.
# output lines in file1 that are not in file2
BEGIN { FS="" } # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; } # file1, index by lineno
(NR!=FNR) { ss2[$0]++; } # file2, index by string
END {
for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}
This stores the entire contents of file1 line by line in a line-number indexed array ll1[]
, and the entire contents of file2 line by line in a line-content indexed associative array ss2[]
. After both files are read, iterate over ll1
and use the in
operator to determine if the line in file1 is present in file2. (This will have have different output to the diff
method if there are duplicates.)
In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.
BEGIN { FS="" }
(NR==FNR) { # file1, index by lineno and string
ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) { # file2
if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}
The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[]
, one indexed by line content ss1[]
. Then as file2 is read, each matching line is deleted from ll1[]
and ss1[]
. At the end the remaining lines from file1 are output, preserving the original order.
In this case, with the problem as stated, you can also divide and conquer using GNU split
(filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:
split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1
Note the use and placement of -
meaning stdin
on the gawk
command line. This is provided by split
from file1 in chunks of 20000 line per-invocation.
For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff
, awk
, though only a POSIX/BSD split
rather than a GNU version.
Stream filter large number of lines that are specified by line number from stdin
Keeping with OP's current idea:
xz -dcq huge.txt.xz | awk '!firstfile_proceed { nums[$1]; next } (FNR in nums)' select.txt firstfile_proceed=1 -
Where the -
(at the end of the line) tells awk
to read from stdin (in this case the output from xz
that's being piped to the awk
call).
Another way to do this (replaces all of the above code):
awk '
FNR==NR { nums[$1]; next } # process first file
FNR in nums # process subsequent file(s)
' select.txt <(xz -dcq huge.txt.xz)
Comments removed and cut down to a 'one-liner':
awk 'FNR==NR {nums[$1];next} FNR in nums' select.txt <(xz -dcq huge.txt.xz)
Adding some logic to implement Ed Morton's comment (exit processing once FNR > largest value from select.txt
):
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' select.txt <(xz -dcq huge.txt.xz)
NOTES:
- keeping in mind we're talking about scanning millions of lines of input ...
FNR > maxFNR
will obviously add some cpu/processing time to the overall operation (though less time thanFNR in nums
)- if the operation routinely needs to pull rows from, say, the last 25% of the file then
FNR > maxFNR
is likely providing little benefit (and probably slowing down the operation) - if the operation routinely finds all desired rows in, say, the first 50% of the file then
FNR> maxFNR
is probably worth the cpu/processing time to keep from scanning the entire input stream (then again, thexz
operation, on the entire file, is likely the biggest time consumer) - net result: the additional
NFR > maxFNR
test may speed-up/slow-down the overall process depending on how much of the input stream needs to be processed in a typical run; OP would need to run some tests to see if there's a (noticeable) difference in overall runtime
Getting one line in a huge file with bash
If all the lines have the same length, the best way by far will be to use dd(1)
and give it a skip parameter.
Let the block size be the length of each line (including the newline), then you can do:
$ dd if=filename bs=<line-length> skip=<line_no - 1> count=1 2>/dev/null
The idea is to seek past all the previous lines (skip=<line_no - 1>
) and read a single line (count=1
). Because the block size is set to the line length (bs=<line-length>
), each block is effectively a single line. Redirect stderr so you don't get the annoying stats at the end.
That should be much more efficient than streaming the lines before the one you want through a program to read all the lines and then throw them away, as dd
will seek to the position you want in the file and read only one line of data from the file.
Find lines from one file that do not appear (even partially) in another file
One in awk, mawk is probably the fastest so use that one:
$ awk '
NR==FNR { # process file1
a[$0] # hash all records to memory
next # process next record
}
{ # process file2
for(i in a) # for each file1 entry in memory
if($0 ~ i) # see if it is found in current file2 record
delete a[i] # and delete if found
}
END { # in the end
for(i in a) # all left from file1
print i # are outputted
}' file1 file2 # mind the order
Output:
bbb
Related Topics
Are There Any Standard Exit Status Codes in Linux
What Is the Meaning of "Posix"
Add a Prefix String to Beginning of Each Line
How to Print the Ld(Linker) Search Path
How to Compile a 32-Bit Binary on a 64-Bit Linux Machine With Gcc/Cmake
How to Use the Lines of a File as Arguments of a Command
How to Set Linux Environment Variables With Ansible
Minimal Executable Size Now 10X Larger After Linking Than 2 Years Ago, For Tiny Programs
How to Calculate Time For an Asm Delay Loop on X86 Linux
How to Generate a List of Files With Their Absolute Path in Linux
How to Test If a Variable Is a Number in Bash
Best Practices When Running Node.Js With Port 80 (Ubuntu/Linode)
"Failed to Load Platform Plugin "Xcb" " While Launching Qt5 App on Linux Without Qt Installed
Ld Cannot Find an Existing Library
How to Set Environment Variable For Everyone Under My Linux System
Forward Host Port to Docker Container