Parsing HTML Page with HTMLagilitypack

Parsing HTML page with HtmlAgilityPack

There are a number of ways to select elements using the agility pack.

Let's assume we have defined our HtmlDocument as follows:

string html = @"<TD class=texte width=""50%"">
<DIV align=right>Name :<B> </B></DIV></TD>
<TD width=""50%"">
<INPUT class=box value=John maxLength=16 size=16 name=user_name>
</TD>
<TR vAlign=center>";

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

1. Simple LINQ
We could use the Descendants() method, passing the name of an element we are in search of:

var inputs = htmlDoc.DocumentNode.Descendants("input");

foreach (var input in inputs)
{
Console.WriteLine(input.Attributes["value"].Value);
// John
}

2. More advanced LINQ
We could narrow that down by using fancier LINQ:

var inputs = from input in htmlDoc.DocumentNode.Descendants("input")
where input.Attributes["class"].Value == "box"
select input;

foreach (var input in inputs)
{
Console.WriteLine(input.Attributes["value"].Value);
// John
}

3. XPath
Or we could use XPath.

string name = htmlDoc.DocumentNode
.SelectSingleNode("//td/input")
.Attributes["value"].Value;

Console.WriteLine(name);
//John

Parsing html files using HtmlAgilityPack

// Description: HAP - Load (From File)
// Website: https://html-agility-pack.net/
// Run: https://dotnetfiddle.net/EsvZyg

// @nuget: HtmlAgilityPack

using System;
using System.Xml;
using HtmlAgilityPack;

public class Program
{
public static void Main()
{
SaveHtmlFile();

#region example

var path = @"test.html";

var doc = new HtmlDocument();
doc.Load(path);

var node = doc.DocumentNode.SelectSingleNode("//body");

Console.WriteLine(node.OuterHtml);

#endregion
}

private static void SaveHtmlFile()
{
var html =
@"<!DOCTYPE html>
<html>
<body>
<h1>This is <b>bold</b> heading</h1>
<p>This is <u>underlined</u> paragraph</p>
<h2>This is <i>italic</i> heading</h2>
</body>
</html> ";

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

htmlDoc.Save("test.html");
}
}

using HtmlAgilityPack for parsing a web page information in C#

I've an article that demonstrates scraping DOM elements with HAP (HTML Agility Pack) using ASP.NET. It simply lets you go through the whole process step by step. You can have a look and try it.

Scraping HTML DOM elements using HtmlAgilityPack (HAP) in ASP.NET

and about your process it's working fine for me. I've tried this way as you did with a single change.

string url = "https://www.google.com";
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a"))
{
outputLabel.Text += node.InnerHtml;
}

Got the output as expected. The problem is you are asking for DocumentElement from HtmlDocument object which actually should be DocumentNode. Here's a response from a developer of HTMLAgilityPack about the problem you are facing.

HTMLDocument.DocumentElement not in object browser

Parsing HTML using HTMLAgilityPack

XPATH is your friend. Try this and forget about that crappy xlink syntax :-)

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Console.WriteLine(node.InnerText.Trim());
}

This expression will select all P nodes that don't have any attributes set. See here for other samples: XPath Syntax

C# Html Agility Pack Parsing Data From Website

When you are visiting a site, you can press F12 and see all the calls that are being made. You can use those API calls to retrieve the data yourself using Postman or via C# using Rest clients.

This is an example of how you can get the data you are looking for. I used Dev tools on chrome to see the call being made under Network Tab.

    public class Event
{
public string eventId { get; set; }
public string time { get; set; }
public string agency { get; set; }
public string lat { get; set; }
public string lon { get; set; }
public string depth { get; set; }
public string rms { get; set; }
public string type { get; set; }
public string m { get; set; }
public object place { get; set; }
public string country { get; set; }
public string city { get; set; }
public string district { get; set; }
public string town { get; set; }
public string other { get; set; }
public object mapImagePath { get; set; }
public object strike1 { get; set; }
public object dip1 { get; set; }
public object rake1 { get; set; }
public object strike2 { get; set; }
public object dip2 { get; set; }
public object rake2 { get; set; }
public object ftype { get; set; }
public object pic { get; set; }
public object file { get; set; }
public object focalId { get; set; }
public string time2 { get; set; }
}

You can use the above class in main program like,

    var client = new RestClient("https://deprem.afad.gov.tr/latestCatalogsList");
client.Timeout = -1;
var request = new RestRequest(Method.POST);
request.AddHeader("Content-Type", "multipart/form-data");
request.AlwaysMultipartFormData = true;
request.AddParameter("m", "0");
request.AddParameter("utc", "0");
request.AddParameter("lastDay", "1");
var response = client.Execute<List<Event>>(request);

List<Event> myData = response.Data;
Console.WriteLine(response.Content);

You will have an object with all the data from the site. You can do whatever you need to with that data.

Please do mark the post answered if it helped

Parsing HTML Document using Html Agility Pack

You are using the wrong method to load the HTML file, that's why the following SelectNodes XPath query doesn't work.

doc.LoadHtml(string html) is expecting a string containing the full HTML document, not a path to the document file.

Try this instead:

doc.Load("E://text.html");

Parse HTML Data Using HTMLAgilityPack

I have solve your problem without using HTMLAgilityPack, Here I am using
System.Xml

Note: You should add some unique values to identify Main li element, Here I have added Class as 'Main'

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;

namespace Test
{
public class Book
{
public string Href { get; set; }
public string Title { get; set; }
public string Author { get; set; }
public string Characters { get; set; }
}

class Program
{
static void Main(string[] args)
{
string str="<div id='title'><li class='Main'><h3><a href='www.harrypotter.com'>Harry Potter</a></h3><div>Harry James Potter is the title character of J. K. Rowling's Harry Potter series. </div>";
str += "<ul><li>Harry Potter</li><li>Hermione Granger</li><li>Ron Weasley</li></ul></li><li class='Main'><h3><a href='www.littleprince.com'>Little Prince</a></h3><div>A little girl lives in a very grown-up world with her mother, who tries to prepare her for it. </div></li></div>";

XmlDocument doc = new XmlDocument();
doc.LoadXml(str);

XmlNodeList xnList= doc.SelectNodes("//*[@id=\"title\"]//li[@class=\"Main\"]");

List<Book> BookList=new List<Book>();

for (int i = 0; i < xnList.Count; i++)
{
XmlNode TitleNode = xnList[i].SelectSingleNode("h3");
XmlNode DescNode = xnList[i].SelectSingleNode("div");
XmlNode AuthorNode = xnList[i].SelectSingleNode("ul");

Book list = new Book();
if(TitleNode!=null)
list.Title=TitleNode.InnerText;
else
list.Title="";

if (DescNode != null)
list.Author = DescNode.InnerText;
else
list.Author = string.Empty;

if (AuthorNode != null)
list.Characters = AuthorNode.InnerText;
else
list.Characters = string.Empty;

if (TitleNode != null && TitleNode.ChildNodes.Count>0)
{
XmlNode HrefNode = TitleNode.ChildNodes[0];
if (HrefNode != null && HrefNode.Attributes.Count > 0 && HrefNode.Attributes["href"] != null)
list.Href = HrefNode.Attributes["href"].Value;
else
list.Href = string.Empty;
}
else
{
list.Href = string.Empty;
}

BookList.Add(list);
}
}
}
}

Parsing HTML page with HtmlAgilityPack using LINQ

Please try the following. You might also consider pulling the table apart as it is a little better formed than the free-text in the 'p' tag.

Cheers, Aaron.

// download the site content and create a new html document
// NOTE: make this asynchronous etc when considering IO performance
var url = "http://explorer.litecoin.net/address/Li7x5UZqWUy7o1tEC2x5o6cNsn2bmDxA2N";
var data = new WebClient().DownloadString(url);
var doc = new HtmlDocument();
doc.LoadHtml(data);

// extract the transactions 'h3' title, the node we want is directly before it
var transTitle =
(from h3 in doc.DocumentNode.Descendants("h3")
where h3.InnerText.ToLower() == "transactions"
select h3).FirstOrDefault();

// tokenise the summary, one line per 'br' element, split each line by the ':' symbol
var summary = transTitle.PreviousSibling.PreviousSibling;
var tokens =
(from row in summary.InnerHtml.Replace("<br>", "|").Split('|')
where !string.IsNullOrEmpty(row.Trim())
let line = row.Trim().Split(':')
where line.Length == 2
select new { name = line[0].Trim(), value = line[1].Trim() });

// using linqpad to debug, the dump command drops the currect variable to the output
tokens.Dump();

'Dump()', is a LinqPad command that dumps the variable to the console, the following is a sample of the output from the Dump command:

  • Balance: 5 LTC
  • Transactions in: 2
  • Received: 5 LTC
  • Transactions out: 0
  • Sent: 0 LTC


Related Topics



Leave a reply



Submit