How to Use HTML Agility Pack

using HtmlAgilityPack for parsing a web page information in C#

I've an article that demonstrates scraping DOM elements with HAP (HTML Agility Pack) using ASP.NET. It simply lets you go through the whole process step by step. You can have a look and try it.

Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET

and about your process it's working fine for me. I've tried this way as you did with a single change.

string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
outputLabel.Text += node.InnerHtml;
}

Got the output as expected. The problem is you are asking for DocumentElement from HtmlDocument object which actually should be DocumentNode. Here's a response from a developer of HTMLAgilityPack about the problem you are facing.

HTMLDocument.DocumentElement not in object browser

Add element to html using htmlagilitypack

Please check the below code, you need to set InnerHtml and save Html document by calling save method doc.Save(yourfilepath).

if (item.Name == "span")
{
HtmlNode div = doc.CreateElement("b");
div.InnerHtml = "Hello world";
item.AppendChild(div);
doc.Save(yourfilepath);
}

How do I use HTML Agility Pack to edit an HTML snippet

  1. The same as a full HTML document. It doesn't matter.
  2. The are 2 options: you may edit InnerHtml property directly (or Text on text nodes) or modifying the dom tree by using e.g. AppendChild, PrependChild etc.
  3. You may use HtmlDocument.DocumentNode.OuterHtml property or use HtmlDocument.Save method (personally I prefer the second option).

As to parsing, I select the text nodes which contain the search term inside your div, and then just use string.Replace method to replace it:

var doc = new HtmlDocument();
doc.LoadHtml(html);
var textNodes = doc.DocumentNode.SelectNodes("/div/text()[contains(.,'specialSearchWord')]");
if (textNodes != null)
foreach (HtmlTextNode node in textNodes)
node.Text = node.Text.Replace("specialSearchWord", "<a class='special' href='http://mysite.com/search/specialSearchWord'>specialSearchWord</a>");

And saving the result to a string:

string result = null;
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
result = writer.ToString();
}

HtmlAgilityPack - How To Get Last Item Value

When you use SelectSingleNode, it picks up the first node that matches the criteria. I would suggest using the SelectNodes and using Last() or LastOrDefault() method to get the last node of the results.

var pages = htmlDoc.DocumentNode.SelectNodes("//a[contains(@class,'page-link rounded')]").Last();

How to use HtmlAgilityPack to get specific data from stock website

You got a problem over this because the website is loading the data into the table via an AJAX request after the page is loaded, but HtmlAgilityPack can only download what the server directly send you.

You can find out this by just looking at the source it downloads via HtmlWeb; in fact, the DocumentNode HTML in the Table tag with id "Listed_IncomeStatement_tableResult" has no data in tbody.

To avoid this problem, you should use Selenium WebDriver.

This extension allows to use some browser behaviour (Firefox or Chrome for example) that will execute the complete page with all the javascript inside of it, and then give you back the complete source of the page after it has been executed.

Here you can find the driver to use Chrome: Chrome Driver

After you imported all the libraries, you will have only to execute the following code:

//!Make sure to add the path to where you extracting the chromedriver.exe:
IWebDriver driver = new ChromeDriver(@"Path\To\Chromedriver");
driver.Navigate().GoToUrl("https://www.vndirect.com.vn/portal/bao-cao-ket-qua-kinh-doanh/vjc.shtml");

After that, you will be able to access to the webpage directly from driver object like:

IWebElement myField = driver.FindElementBy.Id("tools"));

The only problem you get with Chromedriver is that it will open up a browser to render everything. To avoid this, you can try to use another driver like PhantomJS, that will do the same as Chrome but will not open any window.

To have more example on how to use Selenium WebDriver with C#, I reccomend you to get a look at:

Selenium C# tutorial

Html Agility Pack how to get dynamically generated content after page loads

The content you want to get is generated after the page loads, using Javascript and Ajax. HAP cannot get it unless it runs a browser in background and execute the scripts on the page.

.Net Core 2.0

Pre-requisites: you need Chrome web browser installed in your PC.

  1. Create a console application

  2. Install Nuget packages
    Install-Package HtmlAgilityPack
    Install-Package Selenium.WebDriver
    Install-Package Selenium.Chrome.WebDriver

  3. Replace Main method by the following

Code:

    static void Main(string[] args)
{
string url = "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys";
var browser = new ChromeDriver(Environment.CurrentDirectory);
browser.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(30);
browser.Navigate().GoToUrl(url);

var results = browser.FindElementByClassName("ss-results");
var doc = new HtmlDocument();
doc.LoadHtml(results.GetAttribute("innerHTML"));

// Show results
var list = doc.DocumentNode.SelectSingleNode("//div[@class='c-ProductList row ss-targeted']");
foreach (var title in list.SelectNodes(".//h2[@class='c-ProductListItem__title ng-binding']"))
{
Console.WriteLine(title.InnerText);
}
Console.ReadLine();
}

.Net 4.6

  1. Create a console application

  2. Install Nuget package Install-Package HtmlAgilityPack

  3. In Solution Explorer add reference to System.Windows.Form

  4. Add using statements as required

  5. Replace Main method by the following

Code:

[STAThread]
static void Main(string[] args)
{
string url = "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys";

var web = new HtmlWeb();
web.BrowserTimeout = TimeSpan.FromSeconds(30);

var doc = web.LoadFromBrowser(url, o =>
{
var webBrowser = (WebBrowser)o;

// Wait until the list shows up
return webBrowser.Document.Body.InnerHtml.Contains("c-ProductList");
});

// Show results
var list = doc.DocumentNode.SelectSingleNode("//div[@class='c-ProductList row ss-targeted']");
foreach (var title in list.SelectNodes(".//h2[@class='c-ProductListItem__title ng-binding']"))
{
Console.WriteLine(title.InnerText);
}
Console.ReadLine();
}

Displays a list starting with:

Iron Man Mark L

John Wick

The Punisher War Machine Armor

Wonder Woman Deluxe Version

Finding node using HTML agility pack

It looks like you have multiple span elements with class="nameAndIcons". So in order to get them all you could use the SelectNodes function:

var nodes = doc.DocumentNode.SelectNodes("//span[@class='nameAndIcons'"])


Related Topics



Leave a reply



Submit