Get content between a pair of HTML tags using Bash
plain text processing is not good for html/xml parsing. I hope this could give you some idea:
kent$ xmllint --xpath "//body" f.html
<body>
text
<div>
text2
<div>
text3
</div>
</div>
</body>
How to extract contents of specific HTML tag following a match?
You can single the interesting line with sed addresses. In this case, a regexp pattern to match the <a href
sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.*<a href="([^"]+)".*/\1/p' test.html
/blog/2019/4-14-canaries-in-the-coal-mine.html
#post33
To match by article id add this in front of the sed
command
grep -A3 'article id="post36"' test.html | sed -nre '/h3.*href.*(#post[0-9]+|\/blog\/)/ s/.*<a href="([^"]+)".*/\1/p'
In Linux,get content between two strings
this should give you the first <html>
block.
sed -n '/<html>/,/<\/html>/p;/<\/html>/q' file
example:
kent$ cat file
<html>
a
</html>
<html>
b
</html>
<html>
c
</html>
kent$ sed -n '/<html>/,/<\/html>/p;/<\/html>/q' file
<html>
a
</html>
btw, I don't think OP was parsing html/xml. html doesn't have multiple <html>
tags. also his input file may not in xml at all.
Regex select all text between tags
You can use "<pre>(.*?)</pre>"
, (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
As other commenters have suggested, if you're doing something complex, use a HTML parser.
Extract text between the two anchor tags using sed, grep or awk
As Sundeep notes in a comment: best to use a proper HTML parser.
The standard utilities are mostly line-based and deal poorly with quoting; they are ill-equipped to robustly parse HTML, with all its variability around quoting styles and whitespace, let alone recognition of the actual syntax.
GNU grep
offers more flexibility than other implementations: multi-line matching (-z
), support for PCREs (-P
), which enables lookaround assertions.
While the following GNU grep
command works with your sample input, it is still far from being a robust parsing solution:
grep -zPo '<div class="summary_text" itemprop="description">\s*\K.*?(?=\s*</div>)' file
Extract number between html tags
grep
can do it:
grep -Po '(?<=>)[0-9,]*(?=</a></td>)' file
It fetches the combination of numbers and commas in between >
and </a></td>
.
Test
$ cat a
>234,23</a></td>
>234,23</b></td>
$ grep -Po '(?<=>)[0-9,]*(?=</a></td>)' a
234,23
Add HTML tags around each unix GREP result
Using egrep
and sed
You currently have:
$ echo 'timestamp otherText' | egrep 'someText|otherText' | sed 's/timestamp//'
otherText
To put para-tags around the text, add just one substitution to the sed
command:
$ echo 'timestamp otherText' | egrep 'someText|otherText' | sed 's/timestamp//; s|.*|<p>&</p>|'
<p> otherText</p>
Using awk
$ echo 'timestamp otherText' | awk '/someText|otherText/{sub(/timestamp/, ""); print "<p>" $0 "</p>"}'
<p> otherText</p>
Or, getting input from the file my.log
:
awk '/someText|otherText/{sub(/timestamp/, ""); print "<p>" $0 "</p>"}' my.log
Related Topics
Relative Url to a Different Port Number in a Hyperlink
Redirect on Select Option in Select Box
Why Does Inline-Block Cause This Div to Have Height
Why Does Overflow Hidden Stop Floating Elements Escaping Their Container
HTML Favicon.Ico Won't Show on Google Chrome
Why Are My Div Margins Overlapping and How to Fix It
How to Flip Images Horizontally with HTML5
Remove Default Text/Placeholder Present in HTML5 Input Element of Type=Date
Adding Custom Attribute (Html5) Support to Jsf 2.0 Uiinput Component
Does the <Li> Tag in HTML Have an Ending Tag
Chrome Rendering Issue. Fixed Position Anchor with Ul in Body
Img Tag Displays Wrong Orientation
Unskewing the Ends of an Assortment Multiple Skewed Images
Selecting the Last Element Among Various Nested Containers
How to Determine What Technology a Website Is Built On