Regex to Get Src Value from an Img Tag

Regular Expression to extract src attribute from img tag

Your pattern should be (unescaped):

src\s*=\s*"(.+?)"

The important part is the added question mark that matches the group as few times as possible

regex extract img src javascript

Not a big fan of using regex to parse html content, so here goes the longer way

var url = "<img height=\"100\" src=\"data:image/png;base64,testurlhere\" width=\"200\"></img>";var tmp = document.createElement('div');tmp.innerHTML = url;var src = tmp.querySelector('img').getAttribute('src');snippet.log(src)
<!-- Provides the `snippet` object, see http://meta.stackexchange.com/a/242144/134069 --><script src="http://tjcrowder.github.io/simple-snippets-console/snippet.js"></script>

Regex & PHP - isolate src attribute from img tag

If you don't wish to use regex (or any non-standard PHP components), a reasonable solution using the built-in DOMDocument class would be as follows:

<?php
$doc = new DOMDocument();
$doc->loadHTML('<img src="http://example.com/img/image.jpg" ... />');
$imageTags = $doc->getElementsByTagName('img');

foreach($imageTags as $tag) {
echo $tag->getAttribute('src');
}
?>

Regular expression to find src attribute of HTML img element in PHP

Thank every one for helping me out.

I found my solution by using:

pattern = "/src=([^\\\"]+)/"

Regex to get src value from an img tag

Parse your HTML with something else. HTML is not regular and thus regular expressions aren't at all suited to parsing it.

Use an HTML parser, or an XML parser if the HTML is strict. It's a lot easier to get the src attribute's value using XPath:

//img/@src

XML parsing is built into the System.Xml namespace. It's incredibly powerful. HTML parsing is a bit more difficult if the HTML isn't strict, but there are lots of libraries around that will do it for you.

Simple regex to extract img src attribute's value

This appears to meet your requirements:

 id="zoom-product-image" src="([^_]*_img_b\.jpg)

Breaking it down:

  • id="zoom-product-image" src=" : match everything beginning with
    this string
  • ( : begin capture
  • [^_]* : match 0 or more characters that are NOT _
  • _img_b.jpg : match this string
  • ) : end capture

Regex img Tag parsing with src, width, height

To match any img tag with src, height and width attributes that can come in any order and that are in fact optional, you can use

"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"

See the regex demo and an IDEONE Java demo:

String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
System.out.println("\n--- NEW MATCH ---");
}
System.out.println(matcher.group(2) + ": " + matcher.group(4));
}

The regex details:

  • (<img\\b|(?!^)\\G) - the initial boundary matching the <img> tag start or the end of the previous successful match
  • [^>]*? - match any optional attributes we are not interested in (0+ characters other than > so as to stay inside the tag)
    -\\b(src|width|height)= - a whole word src=, width= or height=
  • ([\"']?) - a technical 3rd group to check the attribute value delimiter
  • ([^>]*?) - Group 4 containing the attribute value (0+ characters other than a > as few as possible up to the first
  • \\3 - attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>) at the end of the pattern)

The logic:

  • Match the start of img tag
  • Then, match everything that is inside, but only capture the attributes we need
  • Since we are going to have multiple matches, not groups, we need to find a boundary for each new img tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty()))
  • All there remains to do is to add a list for keeping matches.

How do I select src between if img exists?

First, the standard disclaimer: if you are using regexes to parse a HTML DOM, you are DOING IT WRONG. With all structured data (XML, JSON, and so forth), the right way to parse HTML is to use something built for that purpose, and query it using its querying system.

That said, it is often the case that what you want is a quick hack on the commandline or the search field of an editor or whatever, and you don't want or need to faff with writing an application that loads in DOM-parsing libraries.

In that case, if you're not actually writing a program, and you don't mind that there are edge-cases where any regex you try will break, then consider something like this:

/<img\b[^<>]+\bsrc\s*=\s*"([^"]+)"/i ... maybe replacing the leading / and trailing /i with whatever other thing your language uses to denote a case-insensitive regular expression.

Note that this makes assumptions, that the url is quoted with doublequotes, the tag is correctly formed, there are no extraneous <img strings in the document, there are no doublequotes in the URL, and countless others that I didn't think of, but a proper parser would. These assumptions are a large part of why using a parser is so important: it makes no such assumptions, and if fed garbage, will correctly let you know that you did so, rather than trying to digest it and giving you pain later on.

  • <img\b - an img tag. The word boundary ensures this isn't an imgur tag or whatever.
  • [^<>]+ - one or more characters, with no closing tag, and for safety, no opening tags either.
  • \bsrc\s*=\s* - 'src=', but with optional whitespace, and another word-boundary check.
  • "([^"]+)" - some URL consisting of non-quote characters, within quotes.

Now, be aware that since we're doing NO security checking on the URL, you could be grabbing anything, such as javascript:...something malicious..., or it could be 6GB long - you just don't know. You could add in checking for such things, but you'll always miss something, unless you control the input and know exactly what you're parsing.

Your mention of "my application" does mean that I must reiterate: the above is almost certainly the wrong way to do it if you are writing an application, and the question you should be asking is probably closer to "how do I get the value of the src attribute of an img tag from a HTML page, in my chosen programming language?" rather than "how do I use regexes to extract this token from this HTML tag?"

When I say this, I don't mean "ivory-tower computer scientists will look down their nose at you" - though I admit there can be a lot of that kind of snootiness in programming :D

I mean something more like... "you're setting yourself up for pain as you run into edge-case after edge-case, and spiral down into a deep rabbit-hole of infinitely refining your regex. And you can likely avoid the pain with a simple one-liner, infinitely nicer than regex, perhaps document.querySelector('img[src^="/directory/Images"]') as @LGSon suggests in a comment.

People will say this because they've had this pain, and they're wincing at the idea that you might suffer it too.



Related Topics



Leave a reply



Submit