HTMLAgilityPack - Get element in class by class
Your XPath //div[@class='listicle-page']
matches div
node with all of its descendants. If you need to select child h2
node only, then explicitly specify it by adding /h2
:
//div[@class='listicle-page']/h2
Html Agility Pack get all elements by class
(Updated 2018-03-17)
The problem:
The problem, as you've spotted, is that String.Contains
does not perform a word-boundary check, so Contains("float")
will return true
for both "foo float bar" (correct) and "unfloating" (which is incorrect).
The solution is to ensure that "float" (or whatever your desired class-name is) appears alongside a word-boundary at both ends. A word-boundary is either the start (or end) of a string (or line), whitespace, certain punctuation, etc. In most regular-expressions this is \b
. So the regex you want is simply: \bfloat\b
.
A downside to using a Regex
instance is that they can be slow to run if you don't use the .Compiled
option - and they can be slow to compile. So you should cache the regex instance. This is more difficult if the class-name you're looking for changes at runtime.
Alternatively you can search a string for words by word-boundaries without using a regex by implementing the regex as a C# string-processing function, being careful not to cause any new string or other object allocation (e.g. not using String.Split
).
Approach 1: Using a regular-expression:
Suppose you just want to look for elements with a single, design-time specified class-name:
class Program {
private static readonly Regex _classNameRegex = new Regex( @"\bfloat\b", RegexOptions.Compiled );
private static IEnumerable<HtmlNode> GetFloatElements(HtmlDocument doc) {
return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e => e.Name == "div" && _classNameRegex.IsMatch( e.GetAttributeValue("class", "") ) );
}
}
If you need to choose a single class-name at runtime then you can build a regex:
private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {
Regex regex = new Regex( "\\b" + Regex.Escape( className ) + "\\b", RegexOptions.Compiled );
return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e => e.Name == "div" && regex.IsMatch( e.GetAttributeValue("class", "") ) );
}
If you have multiple class-names and you want to match all of them, you could create an array of Regex
objects and ensure they're all matching, or combine them into a single Regex
using lookarounds, but this results in horrendously complicated expressions - so using a Regex[]
is probably better:
using System.Linq;
private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String[] classNames) {
Regex[] exprs = new Regex[ classNames.Length ];
for( Int32 i = 0; i < exprs.Length; i++ ) {
exprs[i] = new Regex( "\\b" + Regex.Escape( classNames[i] ) + "\\b", RegexOptions.Compiled );
}
return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e =>
e.Name == "div" &&
exprs.All( r =>
r.IsMatch( e.GetAttributeValue("class", "") )
)
);
}
Approach 2: Using non-regex string matching:
The advantage of using a custom C# method to do string matching instead of a regex is hypothetically faster performance and reduced memory usage (though Regex
may be faster in some circumstances - always profile your code first, kids!)
This method below: CheapClassListContains
provides a fast word-boundary-checking string matching function that can be used the same way as regex.IsMatch
:
private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {
return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e =>
e.Name == "div" &&
CheapClassListContains(
e.GetAttributeValue("class", ""),
className,
StringComparison.Ordinal
)
);
}
/// <summary>Performs optionally-whitespace-padded string search without new string allocations.</summary>
/// <remarks>A regex might also work, but constructing a new regex every time this method is called would be expensive.</remarks>
private static Boolean CheapClassListContains(String haystack, String needle, StringComparison comparison)
{
if( String.Equals( haystack, needle, comparison ) ) return true;
Int32 idx = 0;
while( idx + needle.Length <= haystack.Length )
{
idx = haystack.IndexOf( needle, idx, comparison );
if( idx == -1 ) return false;
Int32 end = idx + needle.Length;
// Needle must be enclosed in whitespace or be at the start/end of string
Boolean validStart = idx == 0 || Char.IsWhiteSpace( haystack[idx - 1] );
Boolean validEnd = end == haystack.Length || Char.IsWhiteSpace( haystack[end] );
if( validStart && validEnd ) return true;
idx++;
}
return false;
}
Approach 3: Using a CSS Selector library:
HtmlAgilityPack is somewhat stagnated doesn't support .querySelector
and .querySelectorAll
, but there are third-party libraries that extend HtmlAgilityPack with it: namely Fizzler and CssSelectors. Both Fizzler and CssSelectors implement QuerySelectorAll
, so you can use it like so:
private static IEnumerable<HtmlNode> GetDivElementsWithFloatClass(HtmlDocument doc) {
return doc.QuerySelectorAll( "div.float" );
}
With runtime-defined classes:
private static IEnumerable<HtmlNode> GetDivElementsWithClasses(HtmlDocument doc, IEnumerable<String> classNames) {
String selector = "div." + String.Join( ".", classNames );
return doc.QuerySelectorAll( selector );
}
HTML agility pack get all divs with class
You could use SelectNodes
method
foreach(HtmlNode div in document.DocumentNode.SelectNodes("//div[contains(@class,'listevent')]"))
{
}
If you are more familiar with css style selectors, try fizzler and do this
document.DocumentNode.QuerySelectorAll("div.listevent");
How to Get element by class in HtmlAgilityPack
Html Agility Pack has XPATH support, so you can do something like this:
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span[@class='" + ClassToGet + "']"))
{
string value = node.InnerText;
// etc...
}
This means: get all SPAN elements from the top of the document (first /), recursively (second /) that have a given CLASS attribute. Then for each element, get the inner text.
HtmlAgilityPack: get all elements by class
Learn XPath! :-) It's really simple, and will serve you well. In this case, what you want is:
SelectNodes("//*[@class='" + classValue + "']") ?? Enumerable.Empty<HtmlNode>();
HtmlAgilitypack enumerate all classes
I think you need something like
private static List<string> _InetReadEx(string sUrl) // Returns string list
{
var aRet = new List<string>(); // string list var
try
{
var website = new HtmlAgilityPack.HtmlWeb(); // Init the object
var htmlDoc = website.Load(sUrl); // Load doc from URL
var allElementsWithClassFloat = htmlDoc.DocumentNode.SelectNodes("//*[contains(@class,'pid')]"); // Get all nodes with class value containing pid
if (allElementsWithClassFloat != null) // If nodes found
{
for (int i = 0; i < allElementsWithClassFloat.Count; i++)
{
if (!string.IsNullOrWhiteSpace(allElementsWithClassFloat[i].InnerText) && // if not blank/null
!aRet.Contains(allElementsWithClassFloat[i].InnerText)) // if not already present
{
aRet.Add(allElementsWithClassFloat[i].InnerText); // Add to result
Console.WriteLine(allElementsWithClassFloat[i].InnerText); // Demo line
}
}
}
return aRet;
}
catch (Exception ex)
{
throw ex;
}
}
The XPath is //*[contains(@class,'pid')]
:
//*
- get all element nodes that...[contains(
- contain...@class,'pid'
-pid
substring inside theclass
attribute value)]
- end of thecontains
condition
HtmlAgilityPack, PCL, without XPath: How to get all elements by class?
I've found a way to find elements by class in PCL projects. But you will have to use AngleSharp for this, not HtmlAgilityPack, because XPath is not available in PCL. Check the AngleSharp link for more.
Select all elements by class in AngleSharp:
string html;
using (var client = new HttpClient())
{
string = await client.GetStringAsync("http://your.content.com/some.html");
}
var parser = new HtmlParser();
var doc = parser.Parse(html);
var divs = doc.All.Where(e = > e.LocalName == "div" && e.ClassList.Contains("your-class"));
Note: don't use the data from the site I linked above, because the website above needs JavaScript for os_poll elements to be added, it will not work. That's another problem entirely and is outside of the scope of this question.
how to get all the divs from a html with the same class in html agility pack
use // div in your selection the double slash will select the divs from the whole source having same name as your target div and using a dot .// before slashes will get only the current div
Related Topics
Wpf Webbrowser (3.5 Sp1) Always on Top - Other Suggestion to Display HTML in Wpf
SQL Server Blocked Access to Procedure 'Sys.Sp_Oacreate' of Component 'Ole Automation Procedures'
What's the Best Way to Do a Backwards Loop in C/C#/C++
How to Stop an Application from Opening
Extract Content from Div Tag C# Regex
Xamarin Java.Exe Exited with Code 1 (Proguard Issue)
Compile for Windows on Linux Using Monodevelop
Webbrowser Control and JavaScript Errors
How to Use HTML.Textboxfor with Input Type=Date
How to Set PDF Paragraph or Font Line-Height with Itextsharp
Java Equivalent of C#'s Verbatim Strings with @
How to Implement the Equivalent of SQL In() Using .Net
Are Get and Set Functions Popular with C++ Programmers
Pinvokestackimbalance C# Call to Unmanaged C++ Function