Parsing HTML to Get Content Using C#

Parsing HTML to get content using C#

It isn't 100% clear what you want, but I'm assuming you want the text minus markup; so:

string html;
// obtain some arbitrary html....
using (var client = new WebClient()) {
html = client.DownloadString("http://stackoverflow.com/questions/2038104");
}
// use the html agility pack: http://www.codeplex.com/htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
StringBuilder sb = new StringBuilder();
foreach (HtmlTextNode node in doc.DocumentNode.SelectNodes("//text()")) {
sb.AppendLine(node.Text);
}
string final = sb.ToString();

Parsing HTML content with C# Parser

Here you go with a quick and dirty approach:

    class RoomInfo
{
public String Name { get; set; }
public Dictionary<String, Double> Prices { get; set; }
}

private static void HtmlFile()
{
List<RoomInfo> rooms = new List<RoomInfo>();

HtmlDocument document = new HtmlDocument();
document.Load("file.txt");

var h2Nodes = document.DocumentNode.SelectNodes("//h2");
foreach (var h2Node in h2Nodes)
{
RoomInfo roomInfo = new RoomInfo
{
Name = h2Node.InnerText.Trim(),
Prices = new Dictionary<string, double>()
};

var labels = h2Node.NextSibling.NextSibling.SelectNodes(".//label");
foreach (var label in labels)
{
roomInfo.Prices.Add(label.InnerText.Trim(), Convert.ToDouble(label.Attributes["precio"].Value, CultureInfo.InvariantCulture));
}
rooms.Add(roomInfo);
}
}

The rest is up to you! ;-)

Parsing HTML with c#.net

Give the HTMLAgilityPack a look into. Its a pretty decent HTML parser

http://html-agility-pack.net/?z=codeplex

Here's some code to get you started (requires error checking)

HtmlDocument document = new HtmlDocument(); 
string htmlString = "<html>blabla</html>";
document.LoadHtml(htmlString);
HtmlNodeCollection collection = document.DocumentNode.SelectNodes("//a");
foreach (HtmlNode link in collection)
{
string target = link.Attributes["href"].Value;
}

How can I parse this HTML to get the content I want?

Use HTMLAgilityPack to load the HTML document and then extract the footnotes with this XPath:

//td[text()='[hide]']/following-sibling::td

Basically,what it does is first selecting all td nodes that contain [hide] and then finally go to and select their next sibling. So the next td. Once you have this collection of nodes you can extract their inner text (in C#, with the support provided in HtmlAgilityPack).

How to PARSE HTML Files and SUBMIT information programmatically

I'm not sure if you want all of the things that you mention to execute 'server-side', but assuming that this is the case:

01 - Connect to an HTML file on the
web.

Check out the WebClient class, and the HttpWebRequest class for more advanced scenarios.

02 - Parse its content (text content).
03 - Find out specific content in a
page (for example looking for specific
keywords).

You might want to look at the Html Agility Pack, or if Bobince doesn't notice, regular expressions.

04 - How to submit information
programmatically in HTML page (feeling
forms).

Typically, this will require sending a HTTP POST request, which too can be accomplished with the HttpWebRequest class.



Related Topics



Leave a reply



Submit