How to Get Img/Src or A/Hrefs Using HTML Agility Pack

How to get img/src or a/hrefs using Html Agility Pack?

The first example on the home page does something very similar, but consider:

 HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
string href = link["href"].Value;
// store href somewhere
}

So you can imagine that for img@src, just replace each a with img, and href with src.
You might even be able to simplify to:

 foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/@href | //img/@src")
{
list.Add(node.Value);
}

For relative url handling, look at the Uri class.

Set img src with Html Agility Pack

As far as I can tell, you have two options:

// Will give you a raw string.
// Not ideal if you are planning to
// send this over the network, or save as a file.
var updatedStr = html.DocumentNode.OuterHtml;

// Will let you write to any stream.
// Here, I'm just writing to a string builder as an example.
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
{
html.Save(writer);
}

// These two methods generate the same result, though.
Debug.Assert(string.Equals(updatedStr, sb.ToString()));

HTML Agility pack - parsing img src and href from relative paths

I cant test or run this now, but you can try something like that:

var htmlStr = "yourhtml";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var baseUri = new Uri("baseUriOfYourSite");
var images = doc.DocumentNode.SelectNodes("//img/@src").ToList();
var links = doc.DocumentNode.SelectNodes("//a/@href").ToList();
foreach (var item in images.Concat(links))
{
item.InnerText = new Uri(baseUri, item.InnerText).AbsoluteUri;
}

How to get first occurence of src with HTML Agility Pack

I think I got it, with RegEx I just do:

var items = doc.DocumentNode.SelectNodes(".//item").ToArray();
foreach (var item in items)
{
string matchString = Regex.Match(item.OuterHtml, "<img.+?src=[\"'](.+?) [\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
Console.WriteLine("img: " + matchString);
}

How can i get a single image from a website using HtmlAgilityPack?

The src item in an html element, when seen as just an attribute, could be retrieved by it's attributes property.

However the above code selects a div, so once you select the child img element, you can access it's source:

var imgContainer = document.DocumentNode.SelectSingleNode("//div[@class = 'featured']");

var imgNode = imgContainer.SelectSingleNode("//img");

var src = imgNode.Attributes["src"].Value;

Alternatively find the img directly using the id:

var imgContainer = document.DocumentNode.SelectSingleNode("//img[@id = 'mainpic']");
Console.WriteLine(imgContainer.Attributes["src"].Value);

How to get value of nested img src with Html Agility Pack?

With the url you mentioned in comments, you can do:

var web = new HtmlWeb();
var doc = web.Load("https://www.investing.com/");
var images = doc.DocumentNode.SelectNodes("//*[contains(@class,'js-articles')]//a[@class='img']//img");

foreach(var image in images)
{
string source = image.Attributes["data-src"].Value;
string label = image.Attributes["alt"].Value;
Console.WriteLine($"\"{label}\" {source}");
}

How can I use HTML Agility Pack to retrieve all the images from a website?

You can do this using LINQ, like this:

var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));

EDIT: This code now actually works; I had forgotten to write document.DocumentNode.



Related Topics



Leave a reply



Submit