How to Parse an HTML String for Image Tags to Get at The Src Information

How do you parse an HTML string for image tags to get at the SRC information?

If your input string is valid XHTML you can treat is as xml, load it into an xmldocument, and do XPath magic :) But it's not always the case.

Otherwise you can try this function, that will return all image links from HtmlSource :

public List<Uri> FetchLinksFromSource(string htmlSource)
{
List<Uri> links = new List<Uri>();
string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
foreach (Match m in matchesImgSrc)
{
string href = m.Groups[1].Value;
links.Add(new Uri(href));
}
return links;
}

And you can use it like this :

HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.example.com");
request.Credentials = System.Net.CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
if (response.StatusCode == HttpStatusCode.OK)
{
using(StreamReader sr = new StreamReader(response.GetResponseStream()))
{
List<Uri> links = FetchLinksFromSource(sr.ReadToEnd());
}
}

Parsing an Html string in order to retrieve and change the SRC url from an image tag

I found this to be the easiest way to do it, and also more efficient than jQuery. basically you are injecting the string into a div and using JavaScript to parse through the Html string which is now in a div. By placing the string in the div it give more functionality and flexibility if you want to add more logic/filtering to your solution. I placed a very simple block of code anyone can follow and change to accommodate their solution. Good luck

function parseHtmlString(yourHtmlString) {
var element = document.createElement('div');
element.innerHTML = yourHtmlString;
var imgSrcUrls = element.getElementsByTagName("img");
for (var i = 0; i < imgSrcUrls.length; i++) {
var urlValue = imgSrcUrls[i].getAttribute("src");
if (urlValue) {
imgSrcUrls[i].setAttribute("src", "You Desired Change");
}
}
}

Extract image src from a string

You need to use a capture group () to extract the urls, and if you're wanting to match globally g, i.e. more than once, when using capture groups, you need to use exec in a loop (match ignores capture groups when matching globally).

For example

var m,
urls = [],
str = '<img src="http://site.org/one.jpg />\n <img src="http://site.org/two.jpg />',
rex = /<img[^>]+src="?([^"\s]+)"?\s*\/>/g;

while ( m = rex.exec( str ) ) {
urls.push( m[1] );
}

console.log( urls );
// [ "http://site.org/one.jpg", "http://site.org/two.jpg" ]

Selecting and stripping img src in HTML string

Regex is not the tool for the job. A more robust solution is using a HTML parser like BeautifulSoup to extract the src attribute of the img tag, and a URL parser to remove the query from the URL:

from bs4 import BeautifulSoup
from urllib.parse import urlsplit

input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>'''

soup = BeautifulSoup(input_str, "html.parser")
img_url = soup.find('img')['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
soup.find('img')['src'] = new_url
print(soup)

Output:

<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>

Edit: if you have more than one img tag per string, you can use:

input_str = '''<p><img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br></p><p><br></p><p> This is extra text in the body.</p>
<img src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJZALJ3EN746L6QWQ%2F20190430%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20190430T021347Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=daf406a830d7d0f1ac2d631603b95e7e2ce0bdacd58d5a383d35f6dcd1466012" style="width: 50%; float: right;" class="note-float-right"><br><p><br></p><p> This is extra text in the body.</p>'''

soup = BeautifulSoup(input_str, "html.parser")

for img in soup.find_all('img'):
img_url = img['src']
new_url = urlsplit(img_url)._replace(query=None).geturl()
img['src'] = new_url
print(soup)

This will update the src attribute of each img tag:

<p><img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/></p><p><br/></p><p> This is extra text in the body.</p>
<img class="note-float-right" src="https://s3beanzoid.s3.us-east-2.amazonaws.com/media/django-summernote/2019-04-30/ec707c65-aa6d-4b81-a252-2fa1c1aef087.jpeg" style="width: 50%; float: right;"/><br/><p><br/></p><p> This is extra text in the body.</p>

Need to parse image src from HTML page then display it

Here's an AsyncTask that connects to lulpix, fakes a referrer & user-agent (lulpix tries to block scraping with some pretty lame checks apparently). Starts like this in your Activity:

new ForTheLulz().execute();

The resulting Bitmap is downloaded in a pretty lame way (no caching or checks if the image is already DL:ed) & error handling is overall pretty non-existent - but the basic concept should be ok.

class ForTheLulz extends AsyncTask<Void, Void, Bitmap> {
@Override
protected Bitmap doInBackground(Void... args) {
Bitmap result = null;
try {
Document doc = Jsoup.connect("http://lulpix.com")
.referrer("http://www.google.com")
.userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
.get();
//parse("http://lulpix.com");
if (doc != null) {
Elements elems = doc.getElementsByAttributeValue("class", "pic rounded-8");
if (elems != null && !elems.isEmpty()) {
Element elem = elems.first();
elems = elem.getElementsByTag("img");
if (elems != null && !elems.isEmpty()) {
elem = elems.first();
String src = elem.attr("src");
if (src != null) {
URL url = new URL(src);
// Just assuming that "src" isn't a relative URL is probably stupid.
InputStream is = url.openStream();
try {
result = BitmapFactory.decodeStream(is);
} finally {
is.close();
}
}
}
}
}
} catch (IOException e) {
// Error handling goes here
}
return result;
}
@Override
protected void onPostExecute(Bitmap result) {
ImageView lulz = (ImageView) findViewById(R.id.lulpix);
if (result != null) {
lulz.setImageBitmap(result);
} else {
//Your fallback drawable resource goes here
//lulz.setImageResource(R.drawable.nolulzwherehad);
}
}
}

How to get image source attribute value from img html tag in c#?

The following code will extract the value of the src attribute.

string str = "<div> <img src=\"https://i.testimg.com/images/g/test/s-l400.jpg\" style=\"width: 100%;\"> <div>Test</div> </div>";

// Get the index of where the value of src starts.
int start = str.IndexOf("<img src=\"") + 10;

// Get the substring that starts at start, and goes up to first \".
string src = str.Substring(start, str.IndexOf("\"", start) - start);

Iterate through an html string to find all img tags and replace the src attribute values

If I understand your need correctly you can use HtmlAgilityPack for this purpose. Using regex may cause unwanted behavior. Can you try the code below ?

public static string DoIt()
{
string htmlString = "";
using (WebClient client = new WebClient())
htmlString = client.DownloadString("http://dean.edwards.name/my/base64-ie.html"); //This is an example source for base64 img src, you can change this directly to your source.

HtmlDocument document = new HtmlDocument();
document.LoadHtml(htmlString);
document.DocumentNode.Descendants("img")
.Where(e =>
{
string src = e.GetAttributeValue("src", null) ?? "";
return !string.IsNullOrEmpty(src) && src.StartsWith("data:image");
})
.ToList()
.ForEach(x =>
{
string currentSrcValue = x.GetAttributeValue("src", null);
currentSrcValue = currentSrcValue.Split(',')[1];//Base64 part of string
byte[] imageData = Convert.FromBase64String(currentSrcValue);
string contentId = Guid.NewGuid().ToString();
LinkedResource inline = new LinkedResource(new MemoryStream(imageData), "image/jpeg");
inline.ContentId = contentId;
inline.TransferEncoding = TransferEncoding.Base64;

x.SetAttributeValue("src", "cid:" + inline.ContentId);
});

string result = document.DocumentNode.OuterHtml;
}

You can retrieve HtmlAgilityPack from https://www.nuget.org/packages/HtmlAgilityPack

Hope this helps



Related Topics



Leave a reply



Submit