HTML Agility pack - parsing tables
How about something like:
Using HTML Agility Pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
Console.WriteLine("Found: " + table.Id);
foreach (HtmlNode row in table.SelectNodes("tr")) {
Console.WriteLine("row");
foreach (HtmlNode cell in row.SelectNodes("th|td")) {
Console.WriteLine("cell: " + cell.InnerText);
}
}
}
Note that you can make it prettier with LINQ-to-Objects if you want:
var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
from row in table.SelectNodes("tr").Cast<HtmlNode>()
from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
select new {Table = table.Id, CellText = cell.InnerText};
foreach(var cell in query) {
Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
}
htmlAgilityPack parse table to datatable or array
Some notes:
- You do not need a cast
- you are assuming that each row have headers
- SelectNodes needs to receive an xpath and you are passing just names
if i were you i would use a foreach and model my data, that way i get to have more control and efficiency, but if you still want to do it your way this is how it should be
var query = from table in doc.DocumentNode.SelectNodes("//table")
where table.Descendants("tr").Count() > 1 //make sure there are rows other than header row
from row in table.SelectNodes(".//tr[position()>1]") //skip the header row
from cell in row.SelectNodes("./td")
from header in table.SelectNodes(".//tr[1]/th") //select the header row cells which is the first tr
select new
{
Table = table.Id,
Row = row.InnerText,
Header = header.InnerText,
CellText = cell.InnerText
};
Html Agility Pack loop through table rows and columns
I had to provide the full xpath. I got the full xpath by using Firebug from a suggestion by @Coda (https://stackoverflow.com/a/3104048/1238850) and I ended up with this code:
foreach (HtmlNode row in doc.DocumentNode.SelectNodes("/html/body/table/tbody/tr/td/table[@id='table2']/tbody/tr"))
{
HtmlNodeCollection cells = row.SelectNodes("td");
for (int i = 0; i < cells.Count; ++i)
{
if (i == 0)
{ Response.Write("Person Name : " + cells[i].InnerText + "<br>"); }
else {
Response.Write("Other attributes are: " + cells[i].InnerText + "<br>");
}
}
}
I am sure it can be written way better than this but it is working for me now.
HtmlAgilityPack - Parse table and assign rows to custom model
I ended up resolving this. I was missing two things, and it turns out it wasn't related to HtmlAgilityPack.
- I needed to add .Skip(1) to my foreach row so that it skipped the table header row.
foreach (HtmlNode row in htmlDocument.DocumentNode.SelectNodes(xPath).Skip(1))
- I needed to fix my SalaryLoss value. I was assigning it as an int, but I needed to change that to a double as it was a currency value.
SalaryLoss = double.Parse(arr[6], System.Globalization.NumberStyles.Currency)
parse table with href html agility pack
Inside your foreach
you need to check if the content of your cell contains a <a>
tag. If it contains just get the attribute href from this tag.
Something like this (untested)
foreach (var cell in table.SelectNodes(".//tr/td"))
{
string someVariable = cell.InnerText;
Debug.WriteLine(someVariable);
var links = cell.SelectNodes(".//a");
if (links == null || !links.Any())
{
continue;
}
foreach (var link in links)
{
var href = link.Attributes["href"].Value;
// do whatever you want with the link.
}
}
Get specific Tables with Html Agility Pack
The error is with your second call, the "//tr/td" will go back to the root element. Your indexer is the correct solution for the first part of your problem, the second can be fixed by specifying that you want to navigate from where you are at:
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]");
foreach (var cell in table.SelectNodes(".//tr/td")) // **notice the .**
{
string someVariable = cell.InnerText
}
Not sure what else is going on, but by extending your test table to this code, the following just works on my test. It might mean that you need to share a little more context.
This is the Document I used for the tests:
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<title></title>
</head>
<body>
<table class="newTable">
<tr>
<td>
<table border="0" cellpadding="3" cellspacing="2" width="100%">
<tr><td>
//table 1 - A contents
</td></tr>
</table>
</td>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" class="newTable">
<tr>
<td>
//table 2 contents
<table width="100%" border="0" cellspacing="2" cellpadding="0">
<tr>
<td>
//table 2 - A contents
</td>
</tr>
</table>
<table width="100%" border="0" cellspacing="2" cellpadding="0">
<tr>
<td>
//table 2 - B contents
</td>
</tr>
</table>
<table width="100%" cellspacing="2" cellpadding="0">
<tr>
<td>
//table 2 - C contents
</td>
</tr>
</table>
</td>
</tr>
</table>
<table>
<tr>
<td>
//table 3 contents
</td>
</tr>
</table>
</body>
</html>
And this the code to extract the values you're after:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(text);
var node1A = doc.DocumentNode.SelectSingleNode("//table[1]//table[1]");
string content1A = node1A.InnerText;
Console.WriteLine(content1A);
var node2C = doc.DocumentNode.SelectSingleNode("//table[2]//table[3]");
string content2C = node2C.InnerText;
Console.WriteLine(content2C);
Shows:
Update
Ok, I took your actual HTML and I get a NullReference as well. There must be something that greatly confuses the Agility Pack, not sure why. Some experimentation with the Linq API seems to work though, I hope it can be an alternative for you:
var table = doc.DocumentNode.DescendantsAndSelf("table").Skip(1).First().Descendants("table").First();
var tds = table.Descendants("td");
Htmlagilitypack only parses table rows partialy
The Html on that page is malformed. One possible workaround is stripping the code for last table and parse it as a document.
var client = new WebClient();
string html = client.DownloadString(url);
int lastTableOpen = html.LastIndexOf("<table");
int lastTableClose = html.LastIndexOf("</table");
string lastTable = html.Substring(lastTableOpen, lastTableClose - lastTableOpen + 8);
Then use HtmlAgilityPack:
var table = new HtmlDocument();
table.LoadHtml(lastTable);
foreach (var row in table.DocumentNode.SelectNodes("//table//tr"))
{
Console.WriteLine(row.ToString());
}
But I don't know if there are problems in the table itself.
Related Topics
What Are the Differences Between Generics in C# and Java... and Templates in C++
How to Use Reflection to Call a Generic Method
How to Call Asynchronous Method from Synchronous Method in C#
How to Limit the Amount of Concurrent Async I/O Operations
Create Code First, Many to Many, With Additional Fields in Association Table
How to Execute a Stored Procedure Within C# Program
Listing All Permutations of a String/Integer
String.Replace (Or Other String Modification) Not Working
Parse Datetime With Time Zone of Form Pst/Cest/Utc/Etc
Recursion, Parsing Xml File With Attributes into Treeview C#
Graph Nodes Coordinates Evaluation
Post an HTML Table to Ado.Net Datatable
Why Not Inherit from List≪T≫
How to Save Application Settings in a Windows Forms Application