How do you parse an HTML string for image tags to get at the SRC information?
If your input string is valid XHTML you can treat is as xml, load it into an xmldocument, and do XPath magic :) But it's not always the case.
Otherwise you can try this function, that will return all image links from HtmlSource :
public List<Uri> FetchLinksFromSource(string htmlSource)
{
List<Uri> links = new List<Uri>();
string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
foreach (Match m in matchesImgSrc)
{
string href = m.Groups[1].Value;
links.Add(new Uri(href));
}
return links;
}
And you can use it like this :
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
using(StreamReader sr = new StreamReader(response.GetResponseStream()))
{
List<Uri> links = FetchLinksFromSource(sr.ReadToEnd());
}
}
Parsing an Html string in order to retrieve and change the SRC url from an image tag
I found this to be the easiest way to do it, and also more efficient than jQuery. basically you are injecting the string into a div and using JavaScript to parse through the Html string which is now in a div. By placing the string in the div it give more functionality and flexibility if you want to add more logic/filtering to your solution. I placed a very simple block of code anyone can follow and change to accommodate their solution. Good luck
function parseHtmlString(yourHtmlString) {
var element = document.createElement('div');
element.innerHTML = yourHtmlString;
var imgSrcUrls = element.getElementsByTagName("img");
for (var i = 0; i < imgSrcUrls.length; i++) {
var urlValue = imgSrcUrls[i].getAttribute("src");
if (urlValue) {
imgSrcUrls[i].setAttribute("src", "You Desired Change");
}
}
}
Extract image src from a string
You need to use a capture group ()
to extract the urls, and if you're wanting to match globally g
, i.e. more than once, when using capture groups, you need to use exec
in a loop (match
ignores capture groups when matching globally).
For example
var m,
urls = [],
str = '<img src="http://site.org/one.jpg />\n <img src="http://site.org/two.jpg />',
rex = /<img[^>]+src="?([^"\s]+)"?\s*\/>/g;
while ( m = rex.exec( str ) ) {
urls.push( m[1] );
}
console.log( urls );
// [ "http://site.org/one.jpg", "http://site.org/two.jpg" ]
Selecting and stripping img src in HTML string
Regex is not the tool for the job. A more robust solution is using a HTML parser like BeautifulSoup to extract the src
attribute of the img
tag, and a URL parser to remove the query from the URL:
from bs4 import BeautifulSoup
from urllib.parse import urlsplit
input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'''
soup = BeautifulSoup(input_str, "html.parser")
img_url = soup.find('img')['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
soup.find('img')['src'] = new_url
print(soup)
Output:
<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
Edit: if you have more than one img
tag per string, you can use:
input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
<img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br><p><br></p><p> This is extra text in the body.</p>'''
soup = BeautifulSoup(input_str, "html.parser")
for img in soup.find_all('img'):
img_url = img['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
img['src'] = new_url
print(soup)
This will update the src
attribute of each img
tag:
<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
<img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/><p><br/></p><p> This is extra text in the body.</p>
Need to parse image src from HTML page then display it
Here's an AsyncTask that connects to lulpix, fakes a referrer & user-agent (lulpix tries to block scraping with some pretty lame checks apparently). Starts like this in your Activity
:
new ForTheLulz().execute();
The resulting Bitmap
is downloaded in a pretty lame way (no caching or checks if the image is already DL:ed) & error handling is overall pretty non-existent - but the basic concept should be ok.
class ForTheLulz extends AsyncTask<Void, Void, Bitmap> {
@Override
protected Bitmap doInBackground(Void... args) {
Bitmap result = null;
try {
Document doc = Jsoup.connect("http://lulpix.com")
.referrer("http://www.google.com")
.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
.get();
//parse("http://lulpix.com");
if (doc != null) {
Elements elems = doc.getElementsByAttributeValue("class", "pic rounded-8");
if (elems != null && !elems.isEmpty()) {
Element elem = elems.first();
elems = elem.getElementsByTag("img");
if (elems != null && !elems.isEmpty()) {
elem = elems.first();
String src = elem.attr("src");
if (src != null) {
URL url = new URL(src);
// Just assuming that "src" isn't a relative URL is probably stupid.
InputStream is = url.openStream();
try {
result = BitmapFactory.decodeStream(is);
} finally {
is.close();
}
}
}
}
}
} catch (IOException e) {
// Error handling goes here
}
return result;
}
@Override
protected void onPostExecute(Bitmap result) {
ImageView lulz = (ImageView) findViewById(R.id.lulpix);
if (result != null) {
lulz.setImageBitmap(result);
} else {
//Your fallback drawable resource goes here
//lulz.setImageResource(R.drawable.nolulzwherehad);
}
}
}
How to get image source attribute value from img html tag in c#?
The following code will extract the value of the src
attribute.
string str = "<div> <img src=\"https://i.testimg.com/images/g/test/s-l400.jpg\" style=\"width: 100%;\"> <div>Test</div> </div>";
// Get the index of where the value of src starts.
int start = str.IndexOf("<img src=\"") + 10;
// Get the substring that starts at start, and goes up to first \".
string src = str.Substring(start, str.IndexOf("\"", start) - start);
Iterate through an html string to find all img tags and replace the src attribute values
If I understand your need correctly you can use HtmlAgilityPack for this purpose. Using regex may cause unwanted behavior. Can you try the code below ?
public static string DoIt()
{
string htmlString = "";
using (WebClient client = new WebClient())
htmlString = client.DownloadString("http://dean.edwards.name/my/base64-ie.html"); //This is an example source for base64 img src, you can change this directly to your source.
HtmlDocument document = new HtmlDocument();
document.LoadHtml(htmlString);
document.DocumentNode.Descendants("img")
.Where(e =>
{
string src = e.GetAttributeValue("src", null) ?? "";
return !string.IsNullOrEmpty(src) && src.StartsWith("data:image");
})
.ToList()
.ForEach(x =>
{
string currentSrcValue = x.GetAttributeValue("src", null);
currentSrcValue = currentSrcValue.Split(',')[1];//Base64 part of string
byte[] imageData = Convert.FromBase64String(currentSrcValue);
string contentId = Guid.NewGuid().ToString();
LinkedResource inline = new LinkedResource(new MemoryStream(imageData), "image/jpeg");
inline.ContentId = contentId;
inline.TransferEncoding = TransferEncoding.Base64;
x.SetAttributeValue("src", "cid:" + inline.ContentId);
});
string result = document.DocumentNode.OuterHtml;
}
You can retrieve HtmlAgilityPack from https://www.nuget.org/packages/HtmlAgilityPack
Hope this helps
Related Topics
Why Is ASP.NET Identity Identitydbcontext a Black-Box
Best Practice: Direct SQL Access VS. Web Service
How to Share Sessions Between PHP and ASP.NET Application
What's the Best Way to Do a Backwards Loop in C/C#/C++
Dependency Injection with a Static Logger, Static Helper Class
Why Can't Reference to Child Class Object Refer to the Parent Class Object
How to Get The Checkboxlist Selected Values, What I Have Doesn't Seem to Work C#.Net/Visualwebpart
How to Run Sonarqube Code Analysis for .Net Core (C#) on Linux
Scraping Data Dynamically Generated by JavaScript in HTML Document Using C#
Using a Wwwroot Folder (ASP.NET Core Style) in ASP.NET 4.5 Project
Monodevelop + Naudio + Ubuntu Linux Tells Me Winmm.Dll Not Found
File Exists by File Name Pattern
C++, C# and JavaScript on Winrt
Inline Page Code for Sever Controls Never Works
How to Share Data Between Different Threads in C# Using Aop
How Does Java's Use-Site Variance Compare to C#'s Declaration Site Variance