How can I search for a multiline pattern in a file?
So I discovered pcregrep which stands for Perl Compatible Regular Expressions GREP.
the -M option makes it possible to search for patterns that span line boundaries.
For example, you need to find files where the '_name' variable is followed on the next line by the '_description' variable:
find . -iname '*.py' | xargs pcregrep -M '_name.*\n.*_description'
Tip: you need to include the line break character in your pattern. Depending on your platform, it could be '\n', \r', '\r\n', ...
Search for multiline String in a text file
Use the StringBuilder
for that, read every line from file and append them to StringBuilder
with lineSeparator
StringBuilder lineInFile = new StringBuilder();
while((s=br.readLine()) != null){
lineInFile.append(s).append(System.lineSeparator());
}
Now check the searchString
in lineInFile
by using contains
StringBuilder searchString = new StringBuilder();
builder1.append("line one");
builder1.append(System.lineSeparator());
builder1.append("line two");
System.out.println(lineInFile.toString().contains(searchString));
Search a text file for a multi line string and return line number in Python
Use .count()
and the match
object to count the number of newlines before the match:
import re
with open('example.txt', 'r') as file:
content = file.read()
match = re.search('second line\nThis third line', content)
if match:
print('Found a match starting on line', content.count('\n', 0, match.start()))
match.start()
is the position of the start of the match in content
.
content.count('\n', 0, match.start())
counts the number of newlines in content
between character position 0
and the start of the match.
Use 1 + content.count('\n', 0, match.start())
if you prefer line numbers to start at 1 instead of 0.
How to find patterns across multiple lines using grep?
Grep is an awkward tool for this operation.
pcregrep which is found in most of the modern Linux systems can be used as
pcregrep -M 'abc.*(\n|.)*efg' test.txt
where -M
, --multiline
allow patterns to match more than one line
There is a newer pcre2grep also. Both are provided by the PCRE project.
pcre2grep is available for Mac OS X via Mac Ports as part of port pcre2
:
% sudo port install pcre2
and via Homebrew as:
% brew install pcre
or for pcre2
% brew install pcre2
pcre2grep is also available on Linux (Ubuntu 18.04+)
$ sudo apt install pcre2-utils # PCRE2
$ sudo apt install pcregrep # Older PCRE
grep (bash) multi-line pattern
With GNU grep:
grep -Pzo '>chr2\n[AGCTNacgtn\n]+' file | grep .
Output:
>chr2
TTGNACACCC
TGGGGGAGTA
Check for multi-line content in a file
Following up on a comment from Cyrus, who pointed to How to know if a text file is a subset of another, the following Python one-liner does the trick
python -c "content=open('content').read(); target=open('target').read(); exit(0 if content in target else 1);"
Regex (grep) for multi-line search needed
Without the need to install the grep variant pcregrep
, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P
activate perl-regexp for grep (a powerful extension of regular expressions)
-z
Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o
, see comments.
-o
print only matching. Because we're using -z
, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s)
activate PCRE_DOTALL
, which means that .
finds any character or newline
\N
find anything except newline, even with PCRE_DOTALL
activated
.*?
find .
in non-greedy mode, that is, stops as soon as possible.
^
find start of line
\1
backreference to the first group (\s*
). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c
) source file.
Using grep or other command to return the line number of a multiline pattern
Suppose that pattern is in file pattern
like this:
$ cat pattern
op_3b : 001
ctrl_2b : 00
ini_count : 0
Then, try:
$ awk '$0 ~ pat' RS= pat="$(cat pattern)" logfile
Packet LP
op_3b : 001
ctrl_2b : 00
ini_count : 0
How it works
RS=
This sets the Record Separator
RS
to an empty string. This tells awk to use an empty line as the record separator.pat="$(cat pattern)"
This tells awk to create an awk variable
pat
which contains the contents of the filepattern
.If your shell is bash, then a slightly more efficient form of this command would be
pat="$(<pattern)"
. (Don't use this unless you are sure that your shell is bash.)$0 ~ pat
This tells awk to print any record that matches the pattern.
$0
is the contents of the current record.~
tells awk to do a match between the text in$0
and the regular expression inpat
.(If the contents of
pattern
had any regex active characters, we would need to escape them. Your current example does not have any so this is not a problem.)
Alternative style
Some people prefer a different style for defining awk variables:
$ awk -v RS= -v pat="$(cat pattern)" '$0 ~ pat' logfile
Packet LP
op_3b : 001
ctrl_2b : 00
ini_count : 0
This works the same.
Displaying line numbers
$ awk -F'\n' '$0 ~ pat{print "Line Number="n+1; print "Packet" $0} {n=n+NF-1}' RS='Packet' pat="$(cat pattern)" logfile
Line Number=20
Packet LP
op_3b : 001
ctrl_2b : 00
ini_count : 0
Perl regex: I how do search a file for a multiline pattern without reading the whole file into memory?
You can use the range operator to match everything between two patterns while reading line-by-line:
use strict;
use warnings 'all';
while (<DATA>) {
print if /^assign / .. /;/;
}
__DATA__
foo
assign signal0 = (cond1) ? val1 :
(cond2) ? val2 :
val3;
bar
assign signal1[15:0] = {input1[7:0], input2[7:0]};
baz
assign signal2[34:0] = { 4'b0,
subsig0[3:0],
subsig1,
subsig2,
subsig3[18:2],
subsig4[5:0]
};
qux
Output:
assign signal0 = (cond1) ? val1 :
(cond2) ? val2 :
val3;
assign signal1[15:0] = {input1[7:0], input2[7:0]};
assign signal2[34:0] = { 4'b0,
subsig0[3:0],
subsig1,
subsig2,
subsig3[18:2],
subsig4[5:0]
};
Related Topics
How to Fix Java.Lang.Module.Findexception: Module Java.Se.Ee Not Found
Multiple Glibc Libraries on a Single Host
What's the Best Way to Send a Signal to All Members of a Process Group
./Configure: /Bin/Sh^M: Bad Interpreter
Using Awk to Print All Columns from the Nth to the Last
Syntax Error in Shell Script With Process Substitution
How to Redirect the Output of the Time Command to a File in Linux
Spring Boot Application as a Service
Getting a Unique Id from a Unix-Like System
What Killed My Process and Why
Looping Through the Content of a File in Bash
How to Prevent a Background Process from Being Stopped After Closing Ssh Client in Linux
Read Values into a Shell Variable from a Pipe
How to Install Latest Version of Git on Centos 8.X/7.X/6.X
Pipe To/From the Clipboard in a Bash Script