Get Content Between a Pair of HTML Tags Using Bash

Get content between a pair of HTML tags using Bash

plain text processing is not good for html/xml parsing. I hope this could give you some idea:

kent$  xmllint --xpath "//body" f.html 
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>

How to extract contents of specific HTML tag following a match?

You can single the interesting line with sed addresses. In this case, a regexp pattern to match the <a href

sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.*<a href="([^"]+)".*/\1/p' test.html 
/blog/2019/4-14-canaries-in-the-coal-mine.html
#post33

To match by article id add this in front of the sed command

grep -A3 'article id="post36"' test.html | sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.*<a href="([^"]+)".*/\1/p'

In Linux,get content between two strings

this should give you the first <html> block.

sed -n '/<html>/,/<\/html>/p;/<\/html>/q' file

example:

kent$  cat file
<html>
a
</html>
<html>
b
</html>
<html>
c
</html>

kent$ sed -n '/<html>/,/<\/html>/p;/<\/html>/q' file
<html>
a
</html>

btw, I don't think OP was parsing html/xml. html doesn't have multiple <html> tags. also his input file may not in xml at all.

Regex select all text between tags

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

As other commenters have suggested, if you're doing something complex, use a HTML parser.

Extract text between the two anchor tags using sed, grep or awk

As Sundeep notes in a comment: best to use a proper HTML parser.

The standard utilities are mostly line-based and deal poorly with quoting; they are ill-equipped to robustly parse HTML, with all its variability around quoting styles and whitespace, let alone recognition of the actual syntax.

GNU grep offers more flexibility than other implementations: multi-line matching (-z), support for PCREs (-P), which enables lookaround assertions.

While the following GNU grep command works with your sample input, it is still far from being a robust parsing solution:

 grep -zPo '<div class="summary_text" itemprop="description">\s*\K.*?(?=\s*</div>)' file

Extract number between html tags

grep can do it:

grep -Po '(?<=>)[0-9,]*(?=</a></td>)' file

It fetches the combination of numbers and commas in between > and </a></td>.

Test

$ cat a
>234,23</a></td>
>234,23</b></td>

$ grep -Po '(?<=>)[0-9,]*(?=</a></td>)' a
234,23

Add HTML tags around each unix GREP result

Using egrep and sed

You currently have:

$ echo 'timestamp otherText' | egrep 'someText|otherText' | sed 's/timestamp//'
otherText

To put para-tags around the text, add just one substitution to the sed command:

$ echo 'timestamp otherText' | egrep 'someText|otherText' | sed 's/timestamp//; s|.*|<p>&</p>|'
<p> otherText</p>

Using awk

$ echo 'timestamp otherText' | awk '/someText|otherText/{sub(/timestamp/, ""); print "<p>" $0 "</p>"}'
<p> otherText</p>

Or, getting input from the file my.log:

awk '/someText|otherText/{sub(/timestamp/, ""); print "<p>" $0 "</p>"}' my.log


Related Topics



Leave a reply



Submit