How to Search For a Multiline Pattern in a File

How can I search for a multiline pattern in a file?

So I discovered pcregrep which stands for Perl Compatible Regular Expressions GREP.

the -M option makes it possible to search for patterns that span line boundaries.

For example, you need to find files where the '_name' variable is followed on the next line by the '_description' variable:

find . -iname '*.py' | xargs pcregrep -M '_name.*\n.*_description'

Tip: you need to include the line break character in your pattern. Depending on your platform, it could be '\n', \r', '\r\n', ...

Search for multiline String in a text file

Use the StringBuilder for that, read every line from file and append them to StringBuilder with lineSeparator

StringBuilder lineInFile = new StringBuilder();

while((s=br.readLine()) != null){
lineInFile.append(s).append(System.lineSeparator());
}

Now check the searchString in lineInFile by using contains

StringBuilder searchString = new StringBuilder();

builder1.append("line one");
builder1.append(System.lineSeparator());
builder1.append("line two");

System.out.println(lineInFile.toString().contains(searchString));

Search a text file for a multi line string and return line number in Python

Use .count() and the match object to count the number of newlines before the match:

import re

with open('example.txt', 'r') as file:
content = file.read()
match = re.search('second line\nThis third line', content)
if match:
print('Found a match starting on line', content.count('\n', 0, match.start()))

match.start() is the position of the start of the match in content.

content.count('\n', 0, match.start()) counts the number of newlines in content between character position 0 and the start of the match.

Use 1 + content.count('\n', 0, match.start()) if you prefer line numbers to start at 1 instead of 0.

How to find patterns across multiple lines using grep?

Grep is an awkward tool for this operation.

pcregrep which is found in most of the modern Linux systems can be used as

pcregrep -M  'abc.*(\n|.)*efg' test.txt

where -M, --multiline allow patterns to match more than one line

There is a newer pcre2grep also. Both are provided by the PCRE project.

pcre2grep is available for Mac OS X via Mac Ports as part of port pcre2:

% sudo port install pcre2 

and via Homebrew as:

% brew install pcre

or for pcre2

% brew install pcre2

pcre2grep is also available on Linux (Ubuntu 18.04+)

$ sudo apt install pcre2-utils # PCRE2
$ sudo apt install pcregrep # Older PCRE

grep (bash) multi-line pattern

With GNU grep:

grep -Pzo '>chr2\n[AGCTNacgtn\n]+' file | grep .

Output:


>chr2
TTGNACACCC
TGGGGGAGTA

Check for multi-line content in a file

Following up on a comment from Cyrus, who pointed to How to know if a text file is a subset of another, the following Python one-liner does the trick

python -c "content=open('content').read(); target=open('target').read(); exit(0 if content in target else 1);"

Regex (grep) for multi-line search needed

Without the need to install the grep variant pcregrep, you can do a multiline search with grep.

$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c

Explanation:

-P activate perl-regexp for grep (a powerful extension of regular expressions)

-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.

-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.

In regexp:

(?s) activate PCRE_DOTALL, which means that . finds any character or newline

\N find anything except newline, even with PCRE_DOTALL activated

.*? find . in non-greedy mode, that is, stops as soon as possible.

^ find start of line

\1 backreference to the first group (\s*). This is a try to find the same indentation of method.

As you can imagine, this search prints the main method in a C (*.c) source file.

Using grep or other command to return the line number of a multiline pattern

Suppose that pattern is in file pattern like this:

$ cat pattern
op_3b : 001
ctrl_2b : 00
ini_count : 0

Then, try:

$ awk '$0 ~ pat' RS=  pat="$(cat pattern)" logfile
Packet LP
op_3b : 001
ctrl_2b : 00
ini_count : 0

How it works

  • RS=

    This sets the Record Separator RS to an empty string. This tells awk to use an empty line as the record separator.

  • pat="$(cat pattern)"

    This tells awk to create an awk variable pat which contains the contents of the file pattern.

    If your shell is bash, then a slightly more efficient form of this command would be pat="$(<pattern)". (Don't use this unless you are sure that your shell is bash.)

  • $0 ~ pat

    This tells awk to print any record that matches the pattern.

    $0 is the contents of the current record. ~ tells awk to do a match between the text in $0 and the regular expression in pat.

    (If the contents of pattern had any regex active characters, we would need to escape them. Your current example does not have any so this is not a problem.)

Alternative style

Some people prefer a different style for defining awk variables:

$ awk -v RS=  -v pat="$(cat pattern)" '$0 ~ pat' logfile
Packet LP
op_3b : 001
ctrl_2b : 00
ini_count : 0

This works the same.

Displaying line numbers

$ awk -F'\n' '$0 ~ pat{print "Line Number="n+1; print "Packet" $0} {n=n+NF-1}' RS='Packet'  pat="$(cat pattern)" logfile
Line Number=20
Packet LP
op_3b : 001
ctrl_2b : 00
ini_count : 0

Perl regex: I how do search a file for a multiline pattern without reading the whole file into memory?

You can use the range operator to match everything between two patterns while reading line-by-line:

use strict;
use warnings 'all';

while (<DATA>) {
print if /^assign / .. /;/;
}

__DATA__
foo
assign signal0 = (cond1) ? val1 :
(cond2) ? val2 :
val3;
bar
assign signal1[15:0] = {input1[7:0], input2[7:0]};
baz
assign signal2[34:0] = { 4'b0,
subsig0[3:0],
subsig1,
subsig2,
subsig3[18:2],
subsig4[5:0]
};
qux

Output:

assign signal0 = (cond1) ? val1 :
(cond2) ? val2 :
val3;
assign signal1[15:0] = {input1[7:0], input2[7:0]};
assign signal2[34:0] = { 4'b0,
subsig0[3:0],
subsig1,
subsig2,
subsig3[18:2],
subsig4[5:0]
};


Related Topics



Leave a reply



Submit