How to get img/src or a/hrefs using Html Agility Pack?
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img@src, just replace each a
with img
, and href
with src
.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/@href | //img/@src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri
class.
Set img src with Html Agility Pack
As far as I can tell, you have two options:
// Will give you a raw string.
// Not ideal if you are planning to
// send this over the network, or save as a file.
var updatedStr = html.DocumentNode.OuterHtml;
// Will let you write to any stream.
// Here, I'm just writing to a string builder as an example.
var sb = new StringBuilder();
using (var writer = new StringWriter(sb))
{
html.Save(writer);
}
// These two methods generate the same result, though.
Debug.Assert(string.Equals(updatedStr, sb.ToString()));
HTML Agility pack - parsing img src and href from relative paths
I cant test or run this now, but you can try something like that:
var htmlStr = "yourhtml";
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var baseUri = new Uri("baseUriOfYourSite");
var images = doc.DocumentNode.SelectNodes("//img/@src").ToList();
var links = doc.DocumentNode.SelectNodes("//a/@href").ToList();
foreach (var item in images.Concat(links))
{
item.InnerText = new Uri(baseUri, item.InnerText).AbsoluteUri;
}
How to get first occurence of src with HTML Agility Pack
I think I got it, with RegEx I just do:
var items = doc.DocumentNode.SelectNodes(".//item").ToArray();
foreach (var item in items)
{
string matchString = Regex.Match(item.OuterHtml, "<img.+?src=[\"'](.+?) [\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
Console.WriteLine("img: " + matchString);
}
How can i get a single image from a website using HtmlAgilityPack?
The src
item in an html element, when seen as just an attribute, could be retrieved by it's attributes property.
However the above code selects a div, so once you select the child img
element, you can access it's source:
var imgContainer = document.DocumentNode.SelectSingleNode("//div[@class = 'featured']");
var imgNode = imgContainer.SelectSingleNode("//img");
var src = imgNode.Attributes["src"].Value;
Alternatively find the img
directly using the id:
var imgContainer = document.DocumentNode.SelectSingleNode("//img[@id = 'mainpic']");
Console.WriteLine(imgContainer.Attributes["src"].Value);
How to get value of nested img src with Html Agility Pack?
With the url you mentioned in comments, you can do:
var web = new HtmlWeb();
var doc = web.Load("https://www.investing.com/");
var images = doc.DocumentNode.SelectNodes("//*[contains(@class,'js-articles')]//a[@class='img']//img");
foreach(var image in images)
{
string source = image.Attributes["data-src"].Value;
string label = image.Attributes["alt"].Value;
Console.WriteLine($"\"{label}\" {source}");
}
How can I use HTML Agility Pack to retrieve all the images from a website?
You can do this using LINQ, like this:
var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !String.IsNullOrEmpty(s));
EDIT: This code now actually works; I had forgotten to write document.DocumentNode
.
Related Topics
Memcached - Using with a C# ASP.NET Application
Compile to a Stand-Alone Executable (.Exe) in Visual Studio
Calling a SQL User-Defined Function in a Linq Query
How to Validate That a String Doesn't Contain HTML Using C#
Inline Page Code for Sever Controls Never Works
Convert from Word Document to HTML
Monodevelop Failure "Unknown Msbuild Failure" on Linux
Why Would C# Processstartinforedirectstandardoutput Cause Xcopy Process to Fail
Method Overloading. Can You Overuse It
Duplicate Key Exception from Entity Framework
Asp.Net Vnext Kestrel + Windows Authentication
How Are Dlls Loaded by the Clr
Webutility.HTMLdecode Vs Httputilty.HTMLdecode
Which Linux Distribution Is Best for Developing a Mono Application in a Virtual Machine
How to Add a Local Script File to The HTML of a Webbrowser Control