Using Regular Expressions to Extract the First Image Source from HTML Codes

Using regular expressions to extract the first image source from html codes?

While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.

What I recommend you do is use a DOM parser such as SimpleHTML and use it as such:

function get_first_image($html) {
require_once('SimpleHTML.class.php')

$post_html = str_get_html($html);

$first_img = $post_html->find('img', 0);

if($first_img !== null) {
return $first_img->src;
}

return null;
}

Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.

A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt attribute to be after the src or the opposite, and to overcome this limitation would add more complexity to the regular expression.

Also, consider the following. To properly match an <img> tag using regular expressions and to get only the src attribute (captured in group 2), you need the following regular expression:

<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>

And then again, the above can fail if:

  • The attribute or tag name is in capital and the i modifier is not used.
  • Quotes are not used around the src attribute.
  • Another attribute then src uses the > character somewhere in their value.
  • Some other reason I have not foreseen.

So again, simply don't use regular expressions to parse a dom document.


EDIT: If you want all the images:

function get_images($html){
require_once('SimpleHTML.class.php')

$post_dom = str_get_dom($html);

$img_tags = $post_dom->find('img');

$images = array();

foreach($img_tags as $image) {
$images[] = $image->src;
}

return $images;
}

How do I extract HTML img sources with a regular expression?

The following regexp snippet should work.

<img[^>]+src="([^">]+)"

It looks for text that starts with <img, followed by one or more characters that are not >, then src=". It then grabs everything between that point and the next " or >.

But if at all possible, use a real HTML parser. It's more solid, and will handle edge cases much better.

Regular expression to extract image url from html code

This post is an answer to the question, not a guideline.

The question was not "RegExp vs DOM", the question was "Regular expression to extract image url from html code".

Here it is:

String htmlFragment =
"<img onerror=\"img_onerror(this);\" data-logit=\"true\" data-pid=\"MOBDDDBRHVWQZHYY\"\n" +
" data-imagesize=\"thumb\"\n" +
" data-error-url=\"http://img1a.flixcart.com/mob/thumb/mobile.jpg\"\n" +
" src=\"http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg\"\n" +
" alt=\"Samsung Galaxy S Duos S7562: Mobile\"\n" +
" title=\"Samsung Galaxy S Duos S7562: Mobile\"></img></a>";
Pattern pattern =
Pattern.compile( "(?m)(?s)<img\\s+(.*)src\\s*=\\s*\"([^\"]+)\"(.*)" );
Matcher matcher = pattern.matcher( htmlFragment );
if( matcher.matches()) {
System.err.println(
"OK:\n" +
"1: '" + matcher.group(1) + "'\n" +
"2: '" + matcher.group(2) + "'\n" +
"3: '" + matcher.group(3) + "'\n" );
}

and the ouput:

OK:
1: 'onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
data-imagesize="thumb"
data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
'
2: 'http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg'
3: '
alt="Samsung Galaxy S Duos S7562: Mobile"
title="Samsung Galaxy S Duos S7562: Mobile"></img></a>'

Regex to find the first image in an image tag in an HTML document

As anubhava correctly points out, regex is not 100% reliable for parsing HTML. However, for one-shot-tasks, (i.e. not production code), a regex solution can do a pretty good job (and is quite fast as well):

Capture the image URL filename (sans query or fragment) from the first IMG element into group $1:

<img\b[^>]+?src\s*=\s*['"]?([^\s'"?#>]+)

Note that there are certainly edge cases where this does not work.

Edit: Added ">" to the negated SRC attribute value character class.

Regular Expression to extract src attribute from img tag

Your pattern should be (unescaped):

src\s*=\s*"(.+?)"

The important part is the added question mark that matches the group as few times as possible

regex extract img src javascript

Not a big fan of using regex to parse html content, so here goes the longer way

var url = "<img height=\"100\" src=\"\" width=\"200\"></img>";var tmp = document.createElement('div');tmp.innerHTML = url;var src = tmp.querySelector('img').getAttribute('src');snippet.log(src)
<!-- Provides the `snippet` object, see http://meta.stackexchange.com/a/242144/134069 --><script src="http://tjcrowder.github.io/simple-snippets-console/snippet.js"></script>

Use regular expression to extract img tag from HTML in Perl

Aside from the fact that using regular expressions on HTML isn't very reliable, your regular expression in the following code isn't going to work because it's missing a capture group, so $1 won't be assigned a value.

if ($html =~ /<img. *?src. *?>/)
{
$img = $1;
}

If you want to extract parts of text using a regular expression you need to put that part inside brackets. Like for example:

$example = "hello world";
$example =~ /(hello) world/;

this will set $1 to "hello".

The regular expression itself doesn't make that much sense - where you have ". *?", that'll match any character followed by 0 or more spaces. Is that a typo for ".*?" which would match any number of characters but isn't greedy like ".*", so will stop when it finds a match for the next part of the regex.

This regular expression is possibly closer to what you're looking for. It'll match the first img tag that has a src attribute that starts with "/captcha/" and store the image URL in $1

$html =~ m%<img[^>]*src="(/captcha/[^"]*)"%s;

To break it down how it works. The "m%....%" is just a different way of saying "/.../" that allows you to put slashes in the regex without needing to escape them. "[^>]*" will match zero or more of any character except ">" - so it won't match the end of the tag. And "(/captcha/[^"]*)" is using a capture group to grab anything inside the double quotes that will be the URL. It's also using the "/s" modifier on the end which will treat $html as if it is just one long line of text and ignoring any \n in it which probably isn't needed, but on the off chance the img tag is split over multiple lines it'll still work.

Python Regex to extract content of src of an html tag?

I'm not good at regEx. So my answer may not be best.

Try this.

x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)

than you can see x like below.

['/pic/earth.jpg', '/pic/redrose.jpg']

RegEx explanation :

(?=src) : positive lookup --> only see those have src word

src=\" : must include this specific word src="

(?P somthing) : this expression grouping somthing to name src

[^\"]+ : everything except " character



Related Topics



Leave a reply



Submit