Fast way of finding lines in one file that are not in another?
You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff
output:
diff --new-line-format="" --unchanged-line-format="" file1 file2
The input files should be sorted for this to work. With bash
(and zsh
) you can sort in-place with process substitution <( )
:
diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)
In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff
options that other solutions don't offer, such as -i
to ignore case, or various whitespace options (-E
, -b
, -v
etc) for less strict matching.
Explanation
The options --new-line-format
, --old-line-format
and --unchanged-line-format
let you control the way diff
formats the differences, similar to printf
format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.
If you are familiar with unified diff format, you can partly recreate it with:
diff --old-line-format="-%L" --unchanged-line-format=" %L" \
--new-line-format="+%L" file1 file2
The %L
specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the ---
+++
and @@
lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn
.
The diff
method (along with other suggestions comm
and join
) only produce the expected output with sorted input, though you can use <(sort ...)
to sort in place. Here's a simple awk
(nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.
# output lines in file1 that are not in file2
BEGIN { FS="" } # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; } # file1, index by lineno
(NR!=FNR) { ss2[$0]++; } # file2, index by string
END {
for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}
This stores the entire contents of file1 line by line in a line-number indexed array ll1[]
, and the entire contents of file2 line by line in a line-content indexed associative array ss2[]
. After both files are read, iterate over ll1
and use the in
operator to determine if the line in file1 is present in file2. (This will have have different output to the diff
method if there are duplicates.)
In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.
BEGIN { FS="" }
(NR==FNR) { # file1, index by lineno and string
ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) { # file2
if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}
The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[]
, one indexed by line content ss1[]
. Then as file2 is read, each matching line is deleted from ll1[]
and ss1[]
. At the end the remaining lines from file1 are output, preserving the original order.
In this case, with the problem as stated, you can also divide and conquer using GNU split
(filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:
split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1
Note the use and placement of -
meaning stdin
on the gawk
command line. This is provided by split
from file1 in chunks of 20000 line per-invocation.
For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff
, awk
, though only a POSIX/BSD split
rather than a GNU version.
Finding contents of one file in another file
grep
itself is able to do so. Simply use the flag -f
:
grep -f <patterns> <file>
<patterns>
is a file containing one pattern in each line; and <file>
is the file in which you want to search things.
Note that, to force grep
to consider each line a pattern, even if the contents of each line look like a regular expression, you should use the flag -F, --fixed-strings
.
grep -F -f <patterns> <file>
If your file is a CSV, as you said, you may do:
grep -f <(tr ',' '\n' < data.csv) <file>
As an example, consider the file "a.txt", with the following lines:
alpha
0891234
beta
Now, the file "b.txt", with the lines:
Alpha
0808080
0891234
bEtA
The output of the following command is:
grep -f "a.txt" "b.txt"
0891234
You don't need at all to for
-loop here; grep
itself offers this feature.
Now using your file names:
#!/bin/bash
patterns="/home/nimish/contents.txt"
search="/home/nimish/another_file.csv"
grep -f <(tr ',' '\n' < "${patterns}") "${search}"
You may change ','
to the separator you have in your file.
How to find words from one file in another file?
You can use grep -f
:
grep -Ff "first-file" "second-file"
OR else to match full words:
grep -w -Ff "first-file" "second-file"
UPDATE: As per the comments:
awk 'FNR==NR{a[$1]; next} ($1 in a){delete a[$1]; print $1}' file1 file2
Find content of one file from another file in UNIX
One way with awk
:
awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2
This should be pretty quick. On my machine, it took under 2 seconds to create a lookup of 1 million entries and compare it against 3 million lines.
Machine Specs:
Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (8 cores)
98 GB RAM
Find lines from a file which are not present in another file
The command you have to use is not diff
but comm
comm -23 a.txt b.txt
By default, comm
outputs 3 columns: left-only, right-only, both. The -1
, -2
and -3
switches suppress these columns.
So, -23
hides the right-only and both columns, showing the lines that appear only in the first (left) file.
If you want to find lines that appear in both, you can use -12
, which hides the left-only and right-only columns, leaving you with just the both column.
Deleting lines from one file which are in another file
grep -v -x -f f2 f1
should do the trick.
Explanation:
-v
to select non-matching lines-x
to match whole lines only-f f2
to get patterns fromf2
One can instead use grep -F
or fgrep
to match fixed strings from f2
rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2
as regex patterns).
Check if all of multiple strings or regexes exist in a file
Awk is the tool that the guys who invented grep, shell, etc. invented to do general text manipulation jobs like this so not sure why you'd want to try to avoid it.
In case brevity is what you're looking for, here's the GNU awk one-liner to do just what you asked for:
awk 'NR==FNR{a[$0];next} {for(s in a) if(!index($0,s)) exit 1}' strings RS='^$' file
And here's a bunch of other information and options:
Assuming you're really looking for strings, it'd be:
awk -v strings='string1 string2 string3' '
BEGIN {
numStrings = split(strings,tmp)
for (i in tmp) strs[tmp[i]]
}
numStrings == 0 { exit }
{
for (str in strs) {
if ( index($0,str) ) {
delete strs[str]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file
the above will stop reading the file as soon as all strings have matched.
If you were looking for regexps instead of strings then with GNU awk for multi-char RS and retention of $0 in the END section you could do:
awk -v RS='^$' 'END{exit !(/regexp1/ && /regexp2/ && /regexp3/)}' file
Actually, even if it were strings you could do:
awk -v RS='^$' 'END{exit !(index($0,"string1") && index($0,"string2") && index($0,"string3"))}' file
The main issue with the above 2 GNU awk solutions is that, like @anubhava's GNU grep -P solution, the whole file has to be read into memory at one time whereas with the first awk script above, it'll work in any awk in any shell on any UNIX box and only stores one line of input at a time.
I see you've added a comment under your question to say you could have several thousand "patterns". Assuming you mean "strings" then instead of passing them as arguments to the script you could read them from a file, e.g. with GNU awk for multi-char RS and a file with one search string per line:
awk '
NR==FNR { strings[$0]; next }
{
for (string in strings)
if ( !index($0,string) )
exit 1
}
' file_of_strings RS='^$' file_to_be_searched
and for regexps it'd be:
awk '
NR==FNR { regexps[$0]; next }
{
for (regexp in regexps)
if ( $0 !~ regexp )
exit 1
}
' file_of_regexps RS='^$' file_to_be_searched
If you don't have GNU awk and your input file does not contain NUL characters then you can get the same effect as above by using RS='\0'
instead of RS='^$'
or by appending to variable one line at a time as it's read and then processing that variable in the END section.
If your file_to_be_searched is too large to fit in memory then it'd be this for strings:
awk '
NR==FNR { strings[$0]; numStrings=NR; next }
numStrings == 0 { exit }
{
for (string in strings) {
if ( index($0,string) ) {
delete strings[string]
numStrings--
}
}
}
END { exit (numStrings ? 1 : 0) }
' file_of_strings file_to_be_searched
and the equivalent for regexps:
awk '
NR==FNR { regexps[$0]; numRegexps=NR; next }
numRegexps == 0 { exit }
{
for (regexp in regexps) {
if ( $0 ~ regexp ) {
delete regexps[regexp]
numRegexps--
}
}
}
END { exit (numRegexps ? 1 : 0) }
' file_of_regexps file_to_be_searched
Fastest way to find lines of a file from another larger file in Bash
A small piece of Perl code solved the problem. This is the approach taken:
- store the lines of
file1.txt
in a hash - read
file2.txt
line by line, parse and extract the second field - check if the extracted field is in the hash; if so, print the line
Here is the code:
#!/usr/bin/perl -w
use strict;
if (scalar(@ARGV) != 2) {
printf STDERR "Usage: fgrep.pl smallfile bigfile\n";
exit(2);
}
my ($small_file, $big_file) = ($ARGV[0], $ARGV[1]);
my ($small_fp, $big_fp, %small_hash, $field);
open($small_fp, "<", $small_file) || die "Can't open $small_file: " . $!;
open($big_fp, "<", $big_file) || die "Can't open $big_file: " . $!;
# store contents of small file in a hash
while (<$small_fp>) {
chomp;
$small_hash{$_} = undef;
}
close($small_fp);
# loop through big file and find matches
while (<$big_fp>) {
# no need for chomp
$field = (split(/\|/, $_))[1];
if (defined($field) && exists($small_hash{$field})) {
printf("%s", $_);
}
}
close($big_fp);
exit(0);
I ran the above script with 14K lines in file1.txt and 1.3M lines in file2.txt. It finished in about 13 seconds, producing 126K matches. Here is the time
output for the same:
real 0m11.694s
user 0m11.507s
sys 0m0.174s
I ran @Inian's awk
code:
awk 'FNR==NR{hash[$1]; next}{for (i in hash) if (match($0,i)) {print; break}}' file1.txt FS='|' file2.txt
It was way slower than the Perl solution, since it is looping 14K times for each line in file2.txt - which is really expensive. It aborted after processing 592K records of file2.txt
and producing 40K matched lines. Here is how long it took:
awk: illegal primary in regular expression 24/Nov/2016||592989 at 592989
input record number 675280, file file2.txt
source line number 1
real 55m5.539s
user 54m53.080s
sys 0m5.095s
Using @Inian's other awk
solution, which eliminates the looping issue:
time awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk1.out
real 0m39.966s
user 0m37.916s
sys 0m0.743s
time LC_ALL=C awk -F '|' 'FNR==NR{hash[$1]; next}$2 in hash' file1.txt FS='|' file2.txt > awk.out
real 0m41.057s
user 0m38.475s
sys 0m0.904s
awk
is very impressive here, given that we didn't have to write an entire program to do it.
I ran @oliv's Python code as well. It took about 15 hours to complete the job, and looked like it produced the right results. Building a huge regex isn't as efficient as using a hash lookup. Here the time
output:
real 895m14.862s
user 806m59.219s
sys 1m12.147s
I tried to follow the suggestion to use parallel. However, it failed with fgrep: memory exhausted
error, even with very small block sizes.
What surprised me was that fgrep
was totally unsuitable for this. I aborted it after 22 hours and it produced about 100K matches. I wish fgrep
had an option to force the content of -f file
to be kept in a hash, just like what the Perl code did.
I didn't check join
approach - I didn't want the additional overhead of sorting the files. Also, given fgrep
's poor performance, I don't believe join
would have done better than the Perl code.
Thanks everyone for your attention and responses.
Related Topics
Reusing Custom Makefile for Static Library with Cmake
Set Environment Variables in an Aws Instance
On-The-Fly Output Redirection, Seeing The File Redirection Output While The Program Is Still Running
What Is The Fastest Way to Display an Image in Qt on X11 Without Opengl
Does File ".Bash_History" Always Record Every Command I Ever Issue
Replace System Call in Linux Kernel 3
Tar Error: Unexpected Eof in Archive
Why Is The Close Function Is Called Release in 'struct File_Operations' in The Linux Kernel
Recommended Irc Server (Ircd) for a Small Site
How to Find The Reason for a Dead Process Without Log File on Unix
Surprise! The Shell Suggests Command Line Switches
Bash Loop Through Directory Including Hidden File
How to Remove Special Characters in File Names
Running a Script Just Before Installation of Debian Finishes Using Preseed
Find The Depth of The Current Path