How to Check If One File Is Part of Other

How to check if one file is part of other?

I have a working version using perl.

I thought I had it working with GNU awk, but I didn't. RS=empty string splits on blank lines. See the edit history for the broken awk version.

How can I search for a multiline pattern in a file? shows how to use pcregrep, but I can't see a way to get it to work when the pattern to search may contain regex special characters. -F fixed-string mode doesn't usefully work with multi-line mode: it still treats the pattern as a set of lines to be matched separately. (Not as a multi-line fixed-string to be matched.) I see you were already using pcregrep in your attempt.

BTW, I think you have a bug in your code in the non-sudo case:

function writeToFile {
if [ -w "$1" ] ; then
"$2" >> "$1" # probably you mean echo "$2" >> "$1"
else
echo -e "$2" | sudo tee -a "$1" > /dev/null
fi
}

Anyway, attempts at using line-based tools have met with failure, so it's time to pull out a more serious programming language that doesn't force the newline convention on us. Just read both files into variables, and use a non-regex search:

#!/usr/bin/perl -w
# multi_line_match.pl pattern_file target_file
# exit(0) if a match is found, else exit(1)

#use IO::File;
use File::Slurp;
my $pat = read_file($ARGV[0]);
my $target = read_file($ARGV[1]);

if ((substr($target, 0, length($pat)) eq $pat) or index($target, "\n".$pat) >= 0) {
exit(0);
}
exit(1);

See What is the best way to slurp a file into a string in Perl? to avoid the dependency on File::Slurp (which isn't part of the standard perl distro, or a default Ubuntu 15.04 system). I went for File::Slurp partly for readability of what the program is doing, for non-perl-geeks, compared to:

my $contents = do { local(@ARGV, $/) = $file; <> };

I was working on avoiding reading the full file into memory, with an idea from http://www.perlmonks.org/?node_id=98208. I think non-matching cases would usually still read the whole file at once. Also, the logic was pretty complex for handling a match at the front of the file, and I didn't want to spend a long time testing to make sure it was correct for all cases. Here's what I had before giving up:

#IO::File->input_record_separator($pat);
$/ = $pat; # pat must include a trailing newline if you want it to match one

my $fh = IO::File->new($ARGV[2], O_RDONLY)
or die 'Could not open file ', $ARGV[2], ": $!";

$tail = substr($fh->getline, -1); #fast forward to the first match
#print each occurence in the file
#print IO::File->input_record_separator while $fh->getline;

#FIXME: something clever here to handle the case where $pat matches at the beginning of the file.
do {
# fixme: need to check defined($fh->getline)
if (($tail eq '\n') or ($tail = substr($fh->getline, -1))) {
exit(0); # if there's a 2nd line
}
} while($tail);

exit(1);
$fh->close;

Another idea was to filter patterns and files to be searched through tr '\n' '\r' or something, so they would all be single-lines. (\r being a likely safe choice that wouldn't collide with anything already in a file or a pattern.)

How to check if one file is inside another?

You can split on /:

boolean areSubsets(File f1, File f2) throws IOException {
String[] p = f1.getCanonicalPath().split("/");
String[] q = f2.getCanonicalPath().split("/");
for (int i = 0; i < p.length && i < q.length; i++)
if (!p[i].equals(q[i]))
return false;
return true;
}

Based on fge's comment, in Java 7 you can do the following:

boolean areSubsets(File f1, File f2) {
Path path1 = f1.toPath();
Path path2 = f2.toPath();
return path1.startsWith(path2) || path2.startsWith(path1);
}

Bash - Check if line in one file exists in another file

You can use awk:

awk -F '[, ]' 'FNR==NR{col1[$1]; next} $1 in col1{print $2}' a.txt b.txt
000000
000001
000003
000004
000005
000007
000008
000009
000011
000012
000013
000016
000017

bash text search: find if the content of one file exists in another file

try grep

cat b.txt|grep -f a.txt

Fast way of finding lines in one file that are not in another?

You can achieve this by controlling the formatting of the old/new/unchanged lines in GNU diff output:

diff --new-line-format="" --unchanged-line-format=""  file1 file2

The input files should be sorted for this to work. With bash (and zsh) you can sort in-place with process substitution <( ):

diff --new-line-format="" --unchanged-line-format="" <(sort file1) <(sort file2)

In the above new and unchanged lines are suppressed, so only changed (i.e. removed lines in your case) are output. You may also use a few diff options that other solutions don't offer, such as -i to ignore case, or various whitespace options (-E, -b, -v etc) for less strict matching.


Explanation

The options --new-line-format, --old-line-format and --unchanged-line-format let you control the way diff formats the differences, similar to printf format specifiers. These options format new (added), old (removed) and unchanged lines respectively. Setting one to empty "" prevents output of that kind of line.

If you are familiar with unified diff format, you can partly recreate it with:

diff --old-line-format="-%L" --unchanged-line-format=" %L" \
--new-line-format="+%L" file1 file2

The %L specifier is the line in question, and we prefix each with "+" "-" or " ", like diff -u
(note that it only outputs differences, it lacks the --- +++ and @@ lines at the top of each grouped change).
You can also use this to do other useful things like number each line with %dn.


The diff method (along with other suggestions comm and join) only produce the expected output with sorted input, though you can use <(sort ...) to sort in place. Here's a simple awk (nawk) script (inspired by the scripts linked-to in Konsolebox's answer) which accepts arbitrarily ordered input files, and outputs the missing lines in the order they occur in file1.

# output lines in file1 that are not in file2
BEGIN { FS="" } # preserve whitespace
(NR==FNR) { ll1[FNR]=$0; nl1=FNR; } # file1, index by lineno
(NR!=FNR) { ss2[$0]++; } # file2, index by string
END {
for (ll=1; ll<=nl1; ll++) if (!(ll1[ll] in ss2)) print ll1[ll]
}

This stores the entire contents of file1 line by line in a line-number indexed array ll1[], and the entire contents of file2 line by line in a line-content indexed associative array ss2[]. After both files are read, iterate over ll1 and use the in operator to determine if the line in file1 is present in file2. (This will have have different output to the diff method if there are duplicates.)

In the event that the files are sufficiently large that storing them both causes a memory problem, you can trade CPU for memory by storing only file1 and deleting matches along the way as file2 is read.

BEGIN { FS="" }
(NR==FNR) { # file1, index by lineno and string
ll1[FNR]=$0; ss1[$0]=FNR; nl1=FNR;
}
(NR!=FNR) { # file2
if ($0 in ss1) { delete ll1[ss1[$0]]; delete ss1[$0]; }
}
END {
for (ll=1; ll<=nl1; ll++) if (ll in ll1) print ll1[ll]
}

The above stores the entire contents of file1 in two arrays, one indexed by line number ll1[], one indexed by line content ss1[]. Then as file2 is read, each matching line is deleted from ll1[] and ss1[]. At the end the remaining lines from file1 are output, preserving the original order.

In this case, with the problem as stated, you can also divide and conquer using GNU split (filtering is a GNU extension), repeated runs with chunks of file1 and reading file2 completely each time:

split -l 20000 --filter='gawk -f linesnotin.awk - file2' < file1

Note the use and placement of - meaning stdin on the gawk command line. This is provided by split from file1 in chunks of 20000 line per-invocation.

For users on non-GNU systems, there is almost certainly a GNU coreutils package you can obtain, including on OSX as part of the Apple Xcode tools which provides GNU diff, awk, though only a POSIX/BSD split rather than a GNU version.

Checking if strings in one file occur in another set of files, list those that don't

Since you make use of fgrep, which is synonymous to grep -F we know that the pattern file are fixed strings. To find which patters did not match, you use the following method:

$ grep -oFf pattern_file search_file | grep -voFf - pattern_file

In case of the OP, this becomes:

$ grep -oFf [VLAN-File] [MAC-File] | grep -voFf - [VLAN-File]

You can also do this with awk in a single go:

$ awk '(NR==FNR){a[$0];next}($2 in a){a[$2]++}END{for(i in a) if (a[i]==0) print i}' [VLAN-File] [MAC-File]

The above works for exact matches, so no need to have the extra spaces. If you want to keep the extra spaces, it is a bit more tricky:

$ awk '(NR==FNR){a[$0];next}
{for(i in a) if (i ~ $0) a[i]++}
END{for(i in a) if (a[i]==0) print i}' [VLAN-File] [MAC-File]

All the above will print the VLAN-File entries that do not appear in the MAC-File



Related Topics



Leave a reply



Submit