Parse HTML Links Using C#

Parse HTML links using C#

SubSonic.Sugar.Web.ScrapeLinks seems to do part of what you want, however it grabs the html from a url, rather than from a string. You can check out their implementation here.

Parsing HTML page to extract links

You can use:

href=\"[^\"]+\"

Test here

Parsing Hyperlinks from a webpage

try HtmlAgilityPack

        HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load("http://www.msdn.com");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
Console.WriteLine(link.GetAttributeValue("href", null));
}

this will print out every link on your URL.

if you want to store the links in a list:

 var linkList = doc.DocumentNode.SelectNodes("//a[@href]")
.Select(i => i.GetAttributeValue("href", null)).ToList();

Parsing HTML with c#.net

Give the HTMLAgilityPack a look into. Its a pretty decent HTML parser

http://html-agility-pack.net/?z=codeplex

Here's some code to get you started (requires error checking)

HtmlDocument document = new HtmlDocument(); 
string htmlString = "<html>blabla</html>";
document.LoadHtml(htmlString);
HtmlNodeCollection collection = document.DocumentNode.SelectNodes("//a");
foreach (HtmlNode link in collection)
{
string target = link.Attributes["href"].Value;
}

How to extract specific link in c#?

Use an xpath expression as a selector:

var alink = htmlDocument.DocumentNode
.SelectSingleNode("//li/a[contains(@onclick, 'PDF')]")
.GetAttributeValue("href", "");

Explanation of xpath (as requested):

Match li tag at any depth in the document with an immediate child a tag, which has an attribute onclick that contains the string 'PDF'.



Related Topics



Leave a reply



Submit