Using Regex to Get Text Between Multiple HTML Tags

Regex select all text between tags

You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.

As other commenters have suggested, if you're doing something complex, use a HTML parser.

use RegEx to extract text between html tags

Parsing HTML with regex is not ideal. Others have suggested the HTML Agility Pack. However, if you can guarantee that your input is well-defined and you always know what to expect then using a regex is possible.

If you can make that guarantee, read on. Otherwise you need to consider the other suggestions or define your input better. In fact, you should define your input better regardless because my answer makes a few assumptions. Some questions to consider:

  • Will the HTML be on one line or multiple lines, separated by newline characters?
  • Will the HTML always be in the form of <div>...<h2...>...</h2><h3...>...</h3></div>? Or can there be h1-h6 tags?
  • On top of the hN tags, will the date and number always be between the tags with id-date and nr values for the id attribute?

Depending on the answers to these questions the pattern can change. The following code assumes each HTML fragment follows the structure you shared, that it will have an h2 and h3 with date and number, respectively, and that each tag will be on a new line. If you feed it different input it will likely break till the pattern matches your input's structure.

Dim input As String = "<div id=""div"">" & Environment.Newline & _
"<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
"<h3 id=""nr"">000</h3>" & Environment.Newline & _
"</div>"

Dim pattern As String = "<div[^>]+>.*?" & _
"<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
"<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"

Dim m As Match = Regex.Match(input, pattern, RegexOptions.Singleline)

If m.Success Then
Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
Console.WriteLine("Actual Date: " & actualDate)
Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
Console.WriteLine("Actual Number: " & actualNumber)
Else
Console.WriteLine("No match!")
End If

The pattern can be on one line but I broke it up for clarity. RegexOptions.Singleline is used to allow the . metacharacter to handle \n for newlines.

You also said:

Also and this will be in loop, meaning
there are more div block needed to be
parsed.

Are you looping over separate strings? Or are you expecting multiple occurrences of the above HTML structure in a single string? If the former, the above code should be applied to each string. For the latter you'll want to use Regex.Matches and treat each Match result similarly to the above piece of code.


EDIT: here is some sample code to demonstrate parsing multiple occurrences.

Dim input As String = "<div id=""div"">" & Environment.Newline & _
"<h2 id=""id-date"">09.09.2010</h2>" & Environment.Newline & _
"<h3 id=""nr"">000</h3>" & Environment.Newline & _
"</div>" & _
"<div id=""div"">" & Environment.Newline & _
"<h2 id=""id-date"">09.14.2010</h2>" & Environment.Newline & _
"<h3 id=""nr"">123</h3>" & Environment.Newline & _
"</div>"

Dim pattern As String = "<div[^>]+>.*?" & _
"<h2\sid=""id-date"">(?<Date>\d{2}\.\d{2}\.\d{4})</h2>.*?" & _
"<h3\sid=""nr"">(?<Number>\d+)</h3>.*?</div>"

For Each m As Match In Regex.Matches(input, pattern, RegexOptions.Singleline)
Dim actualDate As DateTime = DateTime.Parse(m.Groups("Date").Value)
Dim actualNumber As Integer = Int32.Parse(m.Groups("Number").Value)
Console.WriteLine("Parsed Date: " & m.Groups("Date").Value)
Console.WriteLine("Actual Date: " & actualDate)
Console.WriteLine("Parsed Number: " & m.Groups("Number").Value)
Console.WriteLine("Actual Number: " & actualNumber)
Next

Regex that extracts text between tags, but not the tags

You can use this following Regex:

>([^<]*)<

or, >[^<]*<

Then eliminate unwanted characters like '<' & '>'

How to regex match individual html tag with inner tag when multiple tags exists in text

In general, you should avoid trying to parse HTML using regex. Given that you are doing this from an IDE, you may not have any choice. One trick which can work here is to use a tempered dot to avoid parsing a closing </button> tag:

<button[^>]*>((?!</button>)[\s\S])*<span>[\s\S]*?</button>

Demo

Most of the pattern is probably familiar to you. Of note, I use [\s\S] to match across newlines. Also, consider the tempered dot trick:

((?!</button>)[\s\S])*

This uses a negative lookahead to match any character, one at a time, so long as the closing </button> tag is not encountered. This prevents the pattern from crossing tags while trying to find a <span>.

Python: Regular expression to extract text between any two tags in a html

Why not just doing this:

import re

f = """
<div class='a'>
<div class='b'>
<div class='c'>
<button>text1</button>
<div class='d'>text2</div>
</div>
</div>
</div>
"""
x = re.sub('<[^>]*>', '', f) # you can also use re.sub('<[A-Za-z\/][^>]*>', '', f)

print '\n'.join(x.split())

This will have the following output:

text1
text2

RegEx to extract text between a HTML tag

Your comment shows that you have neglected to escape the backslashes in your regex string.

And if you want to match lowercase letters add a-z to the character classes or use Pattern.CASE_INSENSITIVE (or add (?i) to the beginning of the regex)

"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>"

If the tag contents may contain newlines, then use Pattern.DOTALL or add (?s) to the beginning of the regex to turn on dotall/singleline mode.



Related Topics



Leave a reply



Submit