Parsing HTML With C#.Net

Parsing HTML with c#.net

Give the HTMLAgilityPack a look into. Its a pretty decent HTML parser

http://html-agility-pack.net/?z=codeplex

Here's some code to get you started (requires error checking)

HtmlDocument document = new HtmlDocument(); 
string htmlString = "<html>blabla</html>";
document.LoadHtml(htmlString);
HtmlNodeCollection collection = document.DocumentNode.SelectNodes("//a");
foreach (HtmlNode link in collection)
{
string target = link.Attributes["href"].Value;
}

C# parsing HTML for general use?

I use the mshtml api.

simply refer to the mshtml assembly then include the namespace.

from there you can declare a HTMLDocument object which is queryable, its a bit of headache in places because the API design forces you to do random casting but it does get the job done and it can always be put in to a util class on it's own so you don't have to keep your oddities in your main app code classes.

Parsing HTML content with C# Parser

Here you go with a quick and dirty approach:

    class RoomInfo
{
public String Name { get; set; }
public Dictionary<String, Double> Prices { get; set; }
}

private static void HtmlFile()
{
List<RoomInfo> rooms = new List<RoomInfo>();

HtmlDocument document = new HtmlDocument();
document.Load("file.txt");

var h2Nodes = document.DocumentNode.SelectNodes("//h2");
foreach (var h2Node in h2Nodes)
{
RoomInfo roomInfo = new RoomInfo
{
Name = h2Node.InnerText.Trim(),
Prices = new Dictionary<string, double>()
};

var labels = h2Node.NextSibling.NextSibling.SelectNodes(".//label");
foreach (var label in labels)
{
roomInfo.Prices.Add(label.InnerText.Trim(), Convert.ToDouble(label.Attributes["precio"].Value, CultureInfo.InvariantCulture));
}
rooms.Add(roomInfo);
}
}

The rest is up to you! ;-)

Does .NET framework offer methods to parse an HTML string?

HtmlDocument

GetElementById

HtmlElement

You can create a dummy html document.

WebBrowser w = new WebBrowser();
w.Navigate(String.Empty);
HtmlDocument doc = w.Document;
doc.Write("<html><head></head><body><img id=\"myImage\" src=\"c:\"/><a id=\"myLink\" href=\"myUrl\"/></body></html>");
Console.WriteLine(doc.Body.Children.Count);
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));
Console.WriteLine(doc.GetElementById("myLink").GetAttribute("href"));
Console.ReadKey();

Output:

2

file:///c:

about:myUrl

Editing elements:

HtmlElement imageElement = doc.GetElementById("myImage");
string newSource = "d:";
imageElement.OuterHtml = imageElement.OuterHtml.Replace(
"src=\"c:\"",
"src=\"" + newSource + "\"");
Console.WriteLine(doc.GetElementById("myImage").GetAttribute("src"));

Output:

file:///d:

Parsing HTML to get content using C#

It isn't 100% clear what you want, but I'm assuming you want the text minus markup; so:

string html;
// obtain some arbitrary html....
using (var client = new WebClient()) {
html = client.DownloadString("http://stackoverflow.com/questions/2038104");
}
// use the html agility pack: http://www.codeplex.com/htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()")) {
sb.AppendLine(node.Text);
}
string final = sb.ToString();

Parsing HTML String

You can use the excellent HTML Agility Pack.

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).



Related Topics



Leave a reply



Submit