C# Regex extract content of a div
Regex is not a good choice for parsing HTML files..
HTML is not strict nor is it regular with its format..
Use htmlagilitypack
Why use parser?
Consider your regex..There are infinite number of cases where you could break your code
- Your regex won't work if there are nested divs
- Some divs dont have an ending tag!(except XHTML)
You can use this code to retrieve it using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var itemList = doc.DocumentNode.SelectNodes("//div[@id='thumbs']")//this xpath selects all div with thubs id
.Select(p => p.InnerText)
.ToList();
//itemList now contain all the div tags content having its id as thumbs
Regex to extract the contents of a div tag
Your regex works for your example. There are some improvements that should be made, though:
<div[^<>]*class="entry"[^<>]*>(?<content>.*?)</div>
[^<>]*
means "match any number of characters except angle brackets", ensuring that we don't accidentally break out of the tag we're in.
.*?
(note the ?
) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last <div class="entry">
tag in your page.
But your regex itself should still have matched something. Perhaps you're not using it correctly?
I don't know Visual Basic, so this is just a shot in the dark, but RegexBuddy suggests the following approach:
Dim RegexObj As New Regex("<div[^<>]*class=""entry""[^<>]*>(?<content>.*?)</div>")
Dim MatchResult As Match = RegexObj.Match(SubjectString)
While MatchResult.Success
ResultList.Add(MatchResult.Groups("content").Value)
MatchResult = MatchResult.NextMatch()
End While
I would recommend against taking the regex approach any further than this. If you insist, you'll end up with a monster regex like the following, which will only work if the form of the div
's contents never varies:
<div[^<>]*class="entry"[^<>]*>\s*
<span[^<>]*class="title"[^<>]*>\s*
(?<title>.*?)
\s*</span>\s*
<span[^<>]*class="description"[^<>]*>\s*
<strong>\s*Address:\s*</strong>\s*
(?<address>.*?)
\s*<strong>\s*Telephone:\s*</strong>\s*
(?<phone>.*?)
\s*</span>\s*</div>
or (behold the joy of multiline strings in VB.NET):
Dim RegexObj As New Regex(
"<div[^<>]*class=""entry""[^<>]*>\s*" & chr(10) & _
"<span[^<>]*class=""title""[^<>]*>\s*" & chr(10) & _
"(?<title>.*?)" & chr(10) & _
"\s*</span>\s*" & chr(10) & _
"<span[^<>]*class=""description""[^<>]*>\s*" & chr(10) & _
"<strong>\s*Address:\s*</strong>\s*" & chr(10) & _
"(?<address>.*?)" & chr(10) & _
"\s*<strong>\s*Telephone:\s*</strong>\s*" & chr(10) & _
"(?<phone>.*?)" & chr(10) & _
"\s*</span>\s*</div>",
RegexOptions.Singleline Or RegexOptions.IgnorePatternWhitespace)
(Of course, now you need to store the results for MatchResult.Groups("title")
etc...)
How to get content between the div tags in C#
You need to use \s*
after \n
, so that the spaces or even line breaks after the </h3>\n
got matched. \s
matches any kind of vertical or horizontal white space character.
Regex.Match(data, @"<h3>Opening hours:</h3>\n\s*<div>(.+?)</div>");
DEMO
Related Topics
Google Play Games Service: Error_Not_Authorized, When Rollout for Beta
Swift: Convert Nsdate to C# Ticks
In C# Wpf, Why Is My Tabcontrol's Selectionchanged Event Firing Too Often
How to Create a Progress Bar with Rounded Corners in iOS Using Xamarin.Forms
Feasibility of C# Development with Mono
How to Compare Two Objects in Unit Test
JavaScript Serialization of Datetime in ASP.NET Is Not Giving a JavaScript Date Object
How to Get Utc Offset in JavaScript (Analog of Timezoneinfo.Getutcoffset in C#)
Asp .Net Button - Onclientclick="Return Function()" Vs Onclientclick="Function()"
Mono: Is Remote Debugging Possible with Monodevelop
When Should I Use Out Parameters
Plink Returning Unwanted Characters via C#
Compiling C# + Wpf on Linux in Order to Run on Windows
Creating a Data.Frame Using R.Net
How to Ask The Socket to Wait for More Data to Come
ASP.NET Core Disable Authentication in Development Environment