HTML Agility Pack Get All Elements by Class

HTMLAgilityPack - Get element in class by class

Your XPath //div[@class='listicle-page'] matches div node with all of its descendants. If you need to select child h2 node only, then explicitly specify it by adding /h2:

//div[@class='listicle-page']/h2

Html Agility Pack get all elements by class

(Updated 2018-03-17)

The problem:

The problem, as you've spotted, is that String.Contains does not perform a word-boundary check, so Contains("float") will return true for both "foo float bar" (correct) and "unfloating" (which is incorrect).

The solution is to ensure that "float" (or whatever your desired class-name is) appears alongside a word-boundary at both ends. A word-boundary is either the start (or end) of a string (or line), whitespace, certain punctuation, etc. In most regular-expressions this is \b. So the regex you want is simply: \bfloat\b.

A downside to using a Regex instance is that they can be slow to run if you don't use the .Compiled option - and they can be slow to compile. So you should cache the regex instance. This is more difficult if the class-name you're looking for changes at runtime.

Alternatively you can search a string for words by word-boundaries without using a regex by implementing the regex as a C# string-processing function, being careful not to cause any new string or other object allocation (e.g. not using String.Split).

Approach 1: Using a regular-expression:

Suppose you just want to look for elements with a single, design-time specified class-name:

class Program {

private static readonly Regex _classNameRegex = new Regex( @"\bfloat\b", RegexOptions.Compiled );

private static IEnumerable<HtmlNode> GetFloatElements(HtmlDocument doc) {
return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e => e.Name == "div" && _classNameRegex.IsMatch( e.GetAttributeValue("class", "") ) );
}
}

If you need to choose a single class-name at runtime then you can build a regex:

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {

Regex regex = new Regex( "\\b" + Regex.Escape( className ) + "\\b", RegexOptions.Compiled );

return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e => e.Name == "div" && regex.IsMatch( e.GetAttributeValue("class", "") ) );
}

If you have multiple class-names and you want to match all of them, you could create an array of Regex objects and ensure they're all matching, or combine them into a single Regex using lookarounds, but this results in horrendously complicated expressions - so using a Regex[] is probably better:

using System.Linq;

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String[] classNames) {

Regex[] exprs = new Regex[ classNames.Length ];
for( Int32 i = 0; i < exprs.Length; i++ ) {
exprs[i] = new Regex( "\\b" + Regex.Escape( classNames[i] ) + "\\b", RegexOptions.Compiled );
}

return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e =>
e.Name == "div" &&
exprs.All( r =>
r.IsMatch( e.GetAttributeValue("class", "") )
)
);
}

Approach 2: Using non-regex string matching:

The advantage of using a custom C# method to do string matching instead of a regex is hypothetically faster performance and reduced memory usage (though Regex may be faster in some circumstances - always profile your code first, kids!)

This method below: CheapClassListContains provides a fast word-boundary-checking string matching function that can be used the same way as regex.IsMatch:

private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) {

return doc
.Descendants()
.Where( n => n.NodeType == NodeType.Element )
.Where( e =>
e.Name == "div" &&
CheapClassListContains(
e.GetAttributeValue("class", ""),
className,
StringComparison.Ordinal
)
);
}

/// <summary>Performs optionally-whitespace-padded string search without new string allocations.</summary>
/// <remarks>A regex might also work, but constructing a new regex every time this method is called would be expensive.</remarks>
private static Boolean CheapClassListContains(String haystack, String needle, StringComparison comparison)
{
if( String.Equals( haystack, needle, comparison ) ) return true;
Int32 idx = 0;
while( idx + needle.Length <= haystack.Length )
{
idx = haystack.IndexOf( needle, idx, comparison );
if( idx == -1 ) return false;

Int32 end = idx + needle.Length;

// Needle must be enclosed in whitespace or be at the start/end of string
Boolean validStart = idx == 0 || Char.IsWhiteSpace( haystack[idx - 1] );
Boolean validEnd = end == haystack.Length || Char.IsWhiteSpace( haystack[end] );
if( validStart && validEnd ) return true;

idx++;
}
return false;
}

Approach 3: Using a CSS Selector library:

HtmlAgilityPack is somewhat stagnated doesn't support .querySelector and .querySelectorAll, but there are third-party libraries that extend HtmlAgilityPack with it: namely Fizzler and CssSelectors. Both Fizzler and CssSelectors implement QuerySelectorAll, so you can use it like so:

private static IEnumerable<HtmlNode> GetDivElementsWithFloatClass(HtmlDocument doc) {

return doc.QuerySelectorAll( "div.float" );
}

With runtime-defined classes:

private static IEnumerable<HtmlNode> GetDivElementsWithClasses(HtmlDocument doc, IEnumerable<String> classNames) {

String selector = "div." + String.Join( ".", classNames );

return doc.QuerySelectorAll( selector );
}

HTML agility pack get all divs with class

You could use SelectNodes method

foreach(HtmlNode div in document.DocumentNode.SelectNodes("//div[contains(@class,'listevent')]"))
{
}

If you are more familiar with css style selectors, try fizzler and do this

document.DocumentNode.QuerySelectorAll("div.listevent"); 

How to Get element by class in HtmlAgilityPack

Html Agility Pack has XPATH support, so you can do something like this:

foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//span[@class='" + ClassToGet + "']"))
{
string value = node.InnerText;
// etc...
}

This means: get all SPAN elements from the top of the document (first /), recursively (second /) that have a given CLASS attribute. Then for each element, get the inner text.

HtmlAgilityPack: get all elements by class

Learn XPath! :-) It's really simple, and will serve you well. In this case, what you want is:

SelectNodes("//*[@class='" + classValue + "']") ?? Enumerable.Empty<HtmlNode>();

HtmlAgilitypack enumerate all classes

I think you need something like

private static List<string> _InetReadEx(string sUrl)    // Returns string list
{
var aRet = new List<string>(); // string list var
try
{
var website = new HtmlAgilityPack.HtmlWeb(); // Init the object
var htmlDoc = website.Load(sUrl); // Load doc from URL

var allElementsWithClassFloat = htmlDoc.DocumentNode.SelectNodes("//*[contains(@class,'pid')]"); // Get all nodes with class value containing pid
if (allElementsWithClassFloat != null) // If nodes found
{
for (int i = 0; i < allElementsWithClassFloat.Count; i++)
{
if (!string.IsNullOrWhiteSpace(allElementsWithClassFloat[i].InnerText) && // if not blank/null
!aRet.Contains(allElementsWithClassFloat[i].InnerText)) // if not already present
{
aRet.Add(allElementsWithClassFloat[i].InnerText); // Add to result
Console.WriteLine(allElementsWithClassFloat[i].InnerText); // Demo line
}
}
}
return aRet;
}
catch (Exception ex)
{
throw ex;
}
}

The XPath is //*[contains(@class,'pid')]:

  • //* - get all element nodes that...
  • [contains( - contain...
  • @class,'pid' - pid substring inside the class attribute value
  • )] - end of the contains condition

HtmlAgilityPack, PCL, without XPath: How to get all elements by class?

I've found a way to find elements by class in PCL projects. But you will have to use AngleSharp for this, not HtmlAgilityPack, because XPath is not available in PCL. Check the AngleSharp link for more.

Select all elements by class in AngleSharp:

string html;
using (var client = new HttpClient())
{
string = await client.GetStringAsync("http://your.content.com/some.html");
}
var parser = new HtmlParser();
var doc = parser.Parse(html);
var divs = doc.All.Where(e = > e.LocalName == "div" && e.ClassList.Contains("your-class"));

Note: don't use the data from the site I linked above, because the website above needs JavaScript for os_poll elements to be added, it will not work. That's another problem entirely and is outside of the scope of this question.

how to get all the divs from a html with the same class in html agility pack

use // div in your selection the double slash will select the divs from the whole source having same name as your target div and using a dot .// before slashes will get only the current div



Related Topics



Leave a reply



Submit