Extract Content from Div Tag C# Regex

C# Regex extract content of a div

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format..

Use htmlagilitypack


Why use parser?

Consider your regex..There are infinite number of cases where you could break your code

  • Your regex won't work if there are nested divs
  • Some divs dont have an ending tag!(except XHTML)

You can use this code to retrieve it using HtmlAgilityPack

HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);

var itemList = doc.DocumentNode.SelectNodes("//div[@id='thumbs']")//this xpath selects all div with thubs id
.Select(p => p.InnerText)
.ToList();

//itemList now contain all the div tags content having its id as thumbs

Regex to extract the contents of a div tag

Your regex works for your example. There are some improvements that should be made, though:

<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>

[^<>]* means "match any number of characters except angle brackets", ensuring that we don't accidentally break out of the tag we're in.

.*? (note the ?) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last <div class="entry"> tag in your page.

But your regex itself should still have matched something. Perhaps you're not using it correctly?

I don't know Visual Basic, so this is just a shot in the dark, but RegexBuddy suggests the following approach:

Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
ResultList.Add(MatchResult.Groups("content").Value)
MatchResult = MatchResult.NextMatch()
End While

I would recommend against taking the regex approach any further than this. If you insist, you'll end up with a monster regex like the following, which will only work if the form of the div's contents never varies:

<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>

or (behold the joy of multiline strings in VB.NET):

Dim RegexObj As New Regex(
"<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
"<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
"(?<title>.*?)" & chr(10) & _
"\s*</span>\s*" & chr(10) & _
"<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
"<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
"(?<address>.*?)" & chr(10) & _
"\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
"(?<phone>.*?)" & chr(10) & _
"\s*</span>\s*</div>",
RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)

(Of course, now you need to store the results for MatchResult.Groups("title") etc...)

How to get content between the div tags in C#

You need to use \s* after \n, so that the spaces or even line breaks after the </h3>\n got matched. \s matches any kind of vertical or horizontal white space character.

Regex.Match(data, @"<h3>Opening hours:</h3>\n\s*<div>(.+?)</div>");

DEMO



Related Topics



Leave a reply



Submit