using HtmlAgilityPack for parsing a web page information in C#
I've an article that demonstrates scraping DOM elements with HAP (HTML Agility Pack) using ASP.NET. It simply lets you go through the whole process step by step. You can have a look and try it.
Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET
and about your process it's working fine for me. I've tried this way as you did with a single change.
string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
outputLabel.Text += node.InnerHtml;
}
Got the output as expected. The problem is you are asking for DocumentElement from HtmlDocument object which actually should be DocumentNode. Here's a response from a developer of HTMLAgilityPack about the problem you are facing.
HTMLDocument.DocumentElement not in object browser
Add element to html using htmlagilitypack
Please check the below code, you need to set InnerHtml
and save Html document by calling save method doc.Save(yourfilepath)
.
if (item.Name == "span")
{
HtmlNode div = doc.CreateElement("b");
div.InnerHtml = "Hello world";
item.AppendChild(div);
doc.Save(yourfilepath);
}
How do I use HTML Agility Pack to edit an HTML snippet
- The same as a full HTML document. It doesn't matter.
- The are 2 options: you may edit
InnerHtml
property directly (orText
on text nodes) or modifying the dom tree by using e.g.AppendChild
,PrependChild
etc. - You may use
HtmlDocument.DocumentNode.OuterHtml
property or useHtmlDocument.Save
method (personally I prefer the second option).
As to parsing, I select the text nodes which contain the search term inside your div
, and then just use string.Replace
method to replace it:
var doc = new HtmlDocument();
doc.LoadHtml(html);
var textNodes = doc.DocumentNode.SelectNodes("/div/text()[contains(.,'specialSearchWord')]");
if (textNodes != null)
foreach (HtmlTextNode node in textNodes)
node.Text = node.Text.Replace("specialSearchWord", "<a class='special' href='http://mysite.com/search/specialSearchWord'>specialSearchWord</a>");
And saving the result to a string:
string result = null;
using (StringWriter writer = new StringWriter())
{
doc.Save(writer);
result = writer.ToString();
}
HtmlAgilityPack - How To Get Last Item Value
When you use SelectSingleNode, it picks up the first node that matches the criteria. I would suggest using the SelectNodes
and using Last()
or LastOrDefault()
method to get the last node of the results.
var pages = htmlDoc.DocumentNode.SelectNodes("//a[contains(@class,'page-link rounded')]").Last();
How to use HtmlAgilityPack to get specific data from stock website
You got a problem over this because the website is loading the data into the table via an AJAX request after the page is loaded, but HtmlAgilityPack can only download what the server directly send you.
You can find out this by just looking at the source it downloads via HtmlWeb; in fact, the DocumentNode HTML in the Table tag with id "Listed_IncomeStatement_tableResult" has no data in tbody.
To avoid this problem, you should use Selenium WebDriver.
This extension allows to use some browser behaviour (Firefox or Chrome for example) that will execute the complete page with all the javascript inside of it, and then give you back the complete source of the page after it has been executed.
Here you can find the driver to use Chrome: Chrome Driver
After you imported all the libraries, you will have only to execute the following code:
//!Make sure to add the path to where you extracting the chromedriver.exe:
IWebDriver driver = new ChromeDriver(@"Path\To\Chromedriver");
driver.Navigate().GoToUrl("https://www.vndirect.com.vn/portal/bao-cao-ket-qua-kinh-doanh/vjc.shtml");
After that, you will be able to access to the webpage directly from driver object like:
IWebElement myField = driver.FindElementBy.Id("tools"));
The only problem you get with Chromedriver is that it will open up a browser to render everything. To avoid this, you can try to use another driver like PhantomJS, that will do the same as Chrome but will not open any window.
To have more example on how to use Selenium WebDriver with C#, I reccomend you to get a look at:
Selenium C# tutorial
Html Agility Pack how to get dynamically generated content after page loads
The content you want to get is generated after the page loads, using Javascript and Ajax. HAP cannot get it unless it runs a browser in background and execute the scripts on the page.
.Net Core 2.0
Pre-requisites: you need Chrome web browser installed in your PC.
Create a console application
Install Nuget packages
Install-Package HtmlAgilityPack
Install-Package Selenium.WebDriver
Install-Package Selenium.Chrome.WebDriver
Replace
Main
method by the following
Code:
static void Main(string[] args)
{
string url = "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys";
var browser = new ChromeDriver(Environment.CurrentDirectory);
browser.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(30);
browser.Navigate().GoToUrl(url);
var results = browser.FindElementByClassName("ss-results");
var doc = new HtmlDocument();
doc.LoadHtml(results.GetAttribute("innerHTML"));
// Show results
var list = doc.DocumentNode.SelectSingleNode("//div[@class='c-ProductList row ss-targeted']");
foreach (var title in list.SelectNodes(".//h2[@class='c-ProductListItem__title ng-binding']"))
{
Console.WriteLine(title.InnerText);
}
Console.ReadLine();
}
.Net 4.6
Create a console application
Install Nuget package
Install-Package HtmlAgilityPack
In Solution Explorer add reference to
System.Windows.Form
Add
using
statements as requiredReplace
Main
method by the following
Code:
[STAThread]
static void Main(string[] args)
{
string url = "https://www.sideshow.com/collectibles?manufacturer=Hot+Toys";
var web = new HtmlWeb();
web.BrowserTimeout = TimeSpan.FromSeconds(30);
var doc = web.LoadFromBrowser(url, o =>
{
var webBrowser = (WebBrowser)o;
// Wait until the list shows up
return webBrowser.Document.Body.InnerHtml.Contains("c-ProductList");
});
// Show results
var list = doc.DocumentNode.SelectSingleNode("//div[@class='c-ProductList row ss-targeted']");
foreach (var title in list.SelectNodes(".//h2[@class='c-ProductListItem__title ng-binding']"))
{
Console.WriteLine(title.InnerText);
}
Console.ReadLine();
}
Displays a list starting with:
Iron Man Mark L
John Wick
The Punisher War Machine Armor
Wonder Woman Deluxe Version
Finding node using HTML agility pack
It looks like you have multiple span
elements with class="nameAndIcons"
. So in order to get them all you could use the SelectNodes
function:
var nodes = doc.DocumentNode.SelectNodes("//span[@class='nameAndIcons'"])
Related Topics
How to Call a C# Library from Native C++ (Using C++\Cli and Ijw)
How to Create an Excel (.Xls and .Xlsx) File in C# Without Installing Microsoft Office
Parsing CSV Files in C#, With Header
Send Values from One Form to Another Form
One Dbcontext Per Web Request... Why
Best Way to Randomize an Array With .Net
There Is Already an Open Datareader Associated With This Command Which Must Be Closed First
Parallel Foreach With Asynchronous Lambda
Difference Between Select and Selectmany
Increase Upload File Size in ASP.NET Core
What Does Void Mean in C, C++, and C#
How to Force My .Net Application to Run as Administrator
Async/Await - When to Return a Task VS Void
How to Detect the Encoding/Codepage of a Text File
What's the Use/Meaning of the @ Character in Variable Names in C#