Html Agility Pack - Parsing Tables

HTML Agility pack - parsing tables

How about something like:
Using HTML Agility Pack

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
Console.WriteLine("Found: " + table.Id);
foreach (HtmlNode row in table.SelectNodes("tr")) {
Console.WriteLine("row");
foreach (HtmlNode cell in row.SelectNodes("th|td")) {
Console.WriteLine("cell: " + cell.InnerText);
}
}
}

Note that you can make it prettier with LINQ-to-Objects if you want:

var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
from row in table.SelectNodes("tr").Cast<HtmlNode>()
from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
select new {Table = table.Id, CellText = cell.InnerText};

foreach(var cell in query) {
Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
}

htmlAgilityPack parse table to datatable or array

Some notes:

  • You do not need a cast
  • you are assuming that each row have headers
  • SelectNodes needs to receive an xpath and you are passing just names

if i were you i would use a foreach and model my data, that way i get to have more control and efficiency, but if you still want to do it your way this is how it should be

var query = from table in doc.DocumentNode.SelectNodes("//table")
where table.Descendants("tr").Count() > 1 //make sure there are rows other than header row
from row in table.SelectNodes(".//tr[position()>1]") //skip the header row
from cell in row.SelectNodes("./td")
from header in table.SelectNodes(".//tr[1]/th") //select the header row cells which is the first tr
select new
{
Table = table.Id,
Row = row.InnerText,
Header = header.InnerText,
CellText = cell.InnerText
};

Html Agility Pack loop through table rows and columns

I had to provide the full xpath. I got the full xpath by using Firebug from a suggestion by @Coda (https://stackoverflow.com/a/3104048/1238850) and I ended up with this code:

foreach (HtmlNode row in doc.DocumentNode.SelectNodes("/html/body/table/tbody/tr/td/table[@id='table2']/tbody/tr"))
{
HtmlNodeCollection cells = row.SelectNodes("td");
for (int i = 0; i < cells.Count; ++i)
{
if (i == 0)
{ Response.Write("Person Name : " + cells[i].InnerText + "<br>"); }
else {
Response.Write("Other attributes are: " + cells[i].InnerText + "<br>");
}
}
}

I am sure it can be written way better than this but it is working for me now.

HtmlAgilityPack - Parse table and assign rows to custom model

I ended up resolving this. I was missing two things, and it turns out it wasn't related to HtmlAgilityPack.

  1. I needed to add .Skip(1) to my foreach row so that it skipped the table header row.
foreach (HtmlNode row in htmlDocument.DocumentNode.SelectNodes(xPath).Skip(1))

  1. I needed to fix my SalaryLoss value. I was assigning it as an int, but I needed to change that to a double as it was a currency value.
SalaryLoss = double.Parse(arr[6], System.Globalization.NumberStyles.Currency)

parse table with href html agility pack

Inside your foreach you need to check if the content of your cell contains a <a> tag. If it contains just get the attribute href from this tag.

Something like this (untested)

foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText;
Debug.WriteLine(someVariable);

var links = cell.SelectNodes(".//a");
if (links == null || !links.Any())
{
continue;
}

foreach (var link in links)
{
var href = link.Attributes["href"].Value;
// do whatever you want with the link.
}
}

Get specific Tables with Html Agility Pack

The error is with your second call, the "//tr/td" will go back to the root element. Your indexer is the correct solution for the first part of your problem, the second can be fixed by specifying that you want to navigate from where you are at:

HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var cell in table.SelectNodes(".//tr/td")) // **notice the .**
{
string someVariable = cell.InnerText
}

Not sure what else is going on, but by extending your test table to this code, the following just works on my test. It might mean that you need to share a little more context.

This is the Document I used for the tests:

<!DOCTYPE html>

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title></title>
</head>
<body>
<table class="newTable">
<tr>
<td>
<table border="0" cellpadding="3" cellspacing="2" width="100%">
<tr><td>
//table 1 - A contents
</td></tr>
</table>
</td>
</tr>

</table>
<table border="0" cellpadding="0" cellspacing="0" class="newTable">
<tr>
<td>
//table 2 contents
<table width="100%" border="0" cellspacing="2" cellpadding="0">
<tr>
<td>
//table 2 - A contents
</td>
</tr>
</table>
<table width="100%" border="0" cellspacing="2" cellpadding="0">
<tr>
<td>
//table 2 - B contents
</td>
</tr>
</table>
<table width="100%" cellspacing="2" cellpadding="0">
<tr>
<td>
//table 2 - C contents
</td>
</tr>
</table>
</td>
</tr>
</table>
<table>
<tr>
<td>
//table 3 contents
</td>
</tr>
</table>
</body>
</html>

And this the code to extract the values you're after:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);

var node1A = doc.DocumentNode.SelectSingleNode("//table[1]//table[1]");
string content1A = node1A.InnerText;
Console.WriteLine(content1A);

var node2C = doc.DocumentNode.SelectSingleNode("//table[2]//table[3]");
string content2C = node2C.InnerText;
Console.WriteLine(content2C);

Shows:

Sample Image

Update

Ok, I took your actual HTML and I get a NullReference as well. There must be something that greatly confuses the Agility Pack, not sure why. Some experimentation with the Linq API seems to work though, I hope it can be an alternative for you:

var table = doc.DocumentNode.DescendantsAndSelf("table").Skip(1).First().Descendants("table").First();
var tds = table.Descendants("td");

Htmlagilitypack only parses table rows partialy

The Html on that page is malformed. One possible workaround is stripping the code for last table and parse it as a document.

var client = new WebClient();
string html = client.DownloadString(url);
int lastTableOpen = html.LastIndexOf("<table");
int lastTableClose = html.LastIndexOf("</table");
string lastTable = html.Substring(lastTableOpen, lastTableClose - lastTableOpen + 8);

Then use HtmlAgilityPack:

var table = new HtmlDocument();
table.LoadHtml(lastTable);
foreach (var row in table.DocumentNode.SelectNodes("//table//tr"))
{
Console.WriteLine(row.ToString());
}

But I don't know if there are problems in the table itself.



Related Topics



Leave a reply



Submit