Invalidprogramexception/Common Language Runtime Detected an Invalid Program

Regular Expression to get the SRC of images in C#

string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;

Regular expression for getting filenames from img tag of html code

Use this question to get the src Regular Expression to get the SRC of images in C#
THEN use this one for the base name new FileInfo(path).Name versus Path.GetFileName(path)

SO it is "kind of a duplicate" but combination of both questions code to accomplish what you want. Generally, it s not the best idea to use regex to parse HTML, so many things in the way HTML can constructed and then be placed on a page, be sure to test all your definitions.

Regular Expression to get the SRC of images in asp.net

Julio's answer is a good one, but the next regex uses backreference in case the src has single or double quotes in it and also contemplates empty src's:

<img[^>]*?\ssrc=(["'])([^\1]*?)\1

The full src of the img (without quotes) will be group number 2 in the regular expression

Regular Expression to remove the first section from src attribute of an image

public class Test
{
public static void Main()
{
var example="<img src=\"data:image/jpeg;base64,/9j/.....";
string res = Regex.Replace(example, "data:image\\/\\w+\\;base64\\,", "");
Console.WriteLine(res);
}
}

and the output is

<img src="/9j/.....

How to get all Images src's of some html

You should use Regex.Matches instead of Match, and you should add the Multiline option I believe:

foreach (Match m in Regex.Matches(sometext, "<img.+?src=[\"'](.+?)[\"'].+?>", RegexOptions.IgnoreCase | RegexOptions.Multiline))
{
string src = m.Groups[1].Value;
// add src to some array
}

Regular Expression to find src from IMG tag

You don't want a regular expression, you want a parser. From this question:

class Program
{
static void Main(string[] args)
{
var web = new HtmlWeb();
var doc = web.Load("http://www.stackoverflow.com");

var nodes = doc.DocumentNode.SelectNodes("//img[@src]");

foreach (var node in nodes)
{
Console.WriteLine(node.src);
}
}
}

Simple regex to extract img src attribute's value

This appears to meet your requirements:

 id="zoom-product-image" src="([^_]*_img_b\.jpg)

Breaking it down:

  • id="zoom-product-image" src=" : match everything beginning with
    this string
  • ( : begin capture
  • [^_]* : match 0 or more characters that are NOT _
  • _img_b.jpg : match this string
  • ) : end capture

How do I select src between if img exists?

First, the standard disclaimer: if you are using regexes to parse a HTML DOM, you are DOING IT WRONG. With all structured data (XML, JSON, and so forth), the right way to parse HTML is to use something built for that purpose, and query it using its querying system.

That said, it is often the case that what you want is a quick hack on the commandline or the search field of an editor or whatever, and you don't want or need to faff with writing an application that loads in DOM-parsing libraries.

In that case, if you're not actually writing a program, and you don't mind that there are edge-cases where any regex you try will break, then consider something like this:

/<img\b[^<>]+\bsrc\s*=\s*"([^"]+)"/i ... maybe replacing the leading / and trailing /i with whatever other thing your language uses to denote a case-insensitive regular expression.

Note that this makes assumptions, that the url is quoted with doublequotes, the tag is correctly formed, there are no extraneous <img strings in the document, there are no doublequotes in the URL, and countless others that I didn't think of, but a proper parser would. These assumptions are a large part of why using a parser is so important: it makes no such assumptions, and if fed garbage, will correctly let you know that you did so, rather than trying to digest it and giving you pain later on.

  • <img\b - an img tag. The word boundary ensures this isn't an imgur tag or whatever.
  • [^<>]+ - one or more characters, with no closing tag, and for safety, no opening tags either.
  • \bsrc\s*=\s* - 'src=', but with optional whitespace, and another word-boundary check.
  • "([^"]+)" - some URL consisting of non-quote characters, within quotes.

Now, be aware that since we're doing NO security checking on the URL, you could be grabbing anything, such as javascript:...something malicious..., or it could be 6GB long - you just don't know. You could add in checking for such things, but you'll always miss something, unless you control the input and know exactly what you're parsing.

Your mention of "my application" does mean that I must reiterate: the above is almost certainly the wrong way to do it if you are writing an application, and the question you should be asking is probably closer to "how do I get the value of the src attribute of an img tag from a HTML page, in my chosen programming language?" rather than "how do I use regexes to extract this token from this HTML tag?"

When I say this, I don't mean "ivory-tower computer scientists will look down their nose at you" - though I admit there can be a lot of that kind of snootiness in programming :D

I mean something more like... "you're setting yourself up for pain as you run into edge-case after edge-case, and spiral down into a deep rabbit-hole of infinitely refining your regex. And you can likely avoid the pain with a simple one-liner, infinitely nicer than regex, perhaps document.querySelector('img[src^="/directory/Images"]') as @LGSon suggests in a comment.

People will say this because they've had this pain, and they're wincing at the idea that you might suffer it too.



Related Topics



Leave a reply



Submit