Using regular expressions to extract the first image source from html codes?
While regular expressions can be good for a large variety of tasks, I find it usually falls short when parsing HTML DOM. The problem with HTML is that the structure of your document is so variable that it is hard to accurately (and by accurately I mean 100% success rate with no false positive) extract a tag.
What I recommend you do is use a DOM parser such as SimpleHTML
and use it as such:
function get_first_image($html) {
require_once('SimpleHTML.class.php')
$post_html = str_get_html($html);
$first_img = $post_html->find('img', 0);
if($first_img !== null) {
return $first_img->src;
}
return null;
}
Some may think this is overkill, but in the end, it will be easier to maintain and also allows for more extensibility. For example, using the DOM parser, I can also get the alt attribute.
A regular expression could be devised to achieve the same goal but would be limited in such way that it would force the alt
attribute to be after the src
or the opposite, and to overcome this limitation would add more complexity to the regular expression.
Also, consider the following. To properly match an <img>
tag using regular expressions and to get only the src
attribute (captured in group 2), you need the following regular expression:
<\s*?img\s+[^>]*?\s*src\s*=\s*(["'])((\\?+.)*?)\1[^>]*?>
And then again, the above can fail if:
- The attribute or tag name is in capital and the
i
modifier is not used. - Quotes are not used around the
src
attribute. - Another attribute then
src
uses the>
character somewhere in their value. - Some other reason I have not foreseen.
So again, simply don't use regular expressions to parse a dom document.
EDIT: If you want all the images:
function get_images($html){
require_once('SimpleHTML.class.php')
$post_dom = str_get_dom($html);
$img_tags = $post_dom->find('img');
$images = array();
foreach($img_tags as $image) {
$images[] = $image->src;
}
return $images;
}
How do I extract HTML img sources with a regular expression?
The following regexp snippet should work.
<img[^>]+src="([^">]+)"
It looks for text that starts with <img
, followed by one or more characters that are not >
, then src="
. It then grabs everything between that point and the next "
or >
.
But if at all possible, use a real HTML parser. It's more solid, and will handle edge cases much better.
Regular expression to extract image url from html code
This post is an answer to the question, not a guideline.
The question was not "RegExp vs DOM", the question was "Regular expression to extract image url from html code".
Here it is:
String htmlFragment =
"<img onerror=\"img_onerror(this);\" data-logit=\"true\" data-pid=\"MOBDDDBRHVWQZHYY\"\n" +
" data-imagesize=\"thumb\"\n" +
" data-error-url=\"http://img1a.flixcart.com/mob/thumb/mobile.jpg\"\n" +
" src=\"http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg\"\n" +
" alt=\"Samsung Galaxy S Duos S7562: Mobile\"\n" +
" title=\"Samsung Galaxy S Duos S7562: Mobile\"></img></a>";
Pattern pattern =
Pattern.compile( "(?m)(?s)<img\\s+(.*)src\\s*=\\s*\"([^\"]+)\"(.*)" );
Matcher matcher = pattern.matcher( htmlFragment );
if( matcher.matches()) {
System.err.println(
"OK:\n" +
"1: '" + matcher.group(1) + "'\n" +
"2: '" + matcher.group(2) + "'\n" +
"3: '" + matcher.group(3) + "'\n" );
}
and the ouput:
OK:
1: 'onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
data-imagesize="thumb"
data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
'
2: 'http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg'
3: '
alt="Samsung Galaxy S Duos S7562: Mobile"
title="Samsung Galaxy S Duos S7562: Mobile"></img></a>'
Regex to find the first image in an image tag in an HTML document
As anubhava correctly points out, regex is not 100% reliable for parsing HTML. However, for one-shot-tasks, (i.e. not production code), a regex solution can do a pretty good job (and is quite fast as well):
Capture the image URL filename (sans query or fragment) from the first IMG element into group $1
:
<img\b[^>]+?src\s*=\s*['"]?([^\s'"?#>]+)
Note that there are certainly edge cases where this does not work.
Edit: Added ">"
to the negated SRC attribute value character class.
Regular Expression to extract src attribute from img tag
Your pattern should be (unescaped):
src\s*=\s*"(.+?)"
The important part is the added question mark that matches the group as few times as possible
regex extract img src javascript
Not a big fan of using regex to parse html content, so here goes the longer way
var url = "<img height=\"100\" src=\"data:image/png;base64,testurlhere\" width=\"200\"></img>";var tmp = document.createElement('div');tmp.innerHTML = url;var src = tmp.querySelector('img').getAttribute('src');snippet.log(src)
<!-- Provides the `snippet` object, see http://meta.stackexchange.com/a/242144/134069 --><script src="http://tjcrowder.github.io/simple-snippets-console/snippet.js"></script>
Use regular expression to extract img tag from HTML in Perl
Aside from the fact that using regular expressions on HTML isn't very reliable, your regular expression in the following code isn't going to work because it's missing a capture group, so $1
won't be assigned a value.
if ($html =~ /<img. *?src. *?>/)
{
$img = $1;
}
If you want to extract parts of text using a regular expression you need to put that part inside brackets. Like for example:
$example = "hello world";
$example =~ /(hello) world/;
this will set $1 to "hello".
The regular expression itself doesn't make that much sense - where you have ". *?", that'll match any character followed by 0 or more spaces. Is that a typo for ".*?" which would match any number of characters but isn't greedy like ".*", so will stop when it finds a match for the next part of the regex.
This regular expression is possibly closer to what you're looking for. It'll match the first img tag that has a src attribute that starts with "/captcha/" and store the image URL in $1
$html =~ m%<img[^>]*src="(/captcha/[^"]*)"%s;
To break it down how it works. The "m%....%" is just a different way of saying "/.../" that allows you to put slashes in the regex without needing to escape them. "[^>]*" will match zero or more of any character except ">" - so it won't match the end of the tag. And "(/captcha/[^"]*)" is using a capture group to grab anything inside the double quotes that will be the URL. It's also using the "/s" modifier on the end which will treat $html
as if it is just one long line of text and ignoring any \n
in it which probably isn't needed, but on the off chance the img tag is split over multiple lines it'll still work.
Python Regex to extract content of src of an html tag?
I'm not good at regEx. So my answer may not be best.
Try this.
x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)
than you can see x like below.
['/pic/earth.jpg', '/pic/redrose.jpg']
RegEx explanation :
(?=src) : positive lookup --> only see those have src word
src=\" : must include this specific word src="
(?P somthing) : this expression grouping somthing to name src
[^\"]+ : everything except " character
Related Topics
Phpexcel Reader -- Help Required
How to Start a Get/Post/Put/Delete Request and Judge Request Type in PHP
How to Run Yii2 Application on Hosting
Forbidden :You Don't Have Permission to Access /Phpmyadmin on This Server
How to Get PHPunit Mockobjects to Return Different Values Based on a Parameter
Why Does "Echo '2' . Print(2) + 3" Print 521
Laravel Use Same Form for Create and Edit
How to Add Additional PHP Versions to Mamp
Converting Named HTML Entities to Numeric HTML Entities
How to Use JSON.Stringify and JSON_Decode() Properly
Natural Sorting Algorithm in PHP with Support for Unicode
"Error 404 Not Found" in Magento Admin Login Page
How to Pass Array Through Hidden Field
Creating Variable Name by Concatenating Strings in PHP
Improve This PHP Bitfield Class for Settings/Permissions
PHP Error: "The Zip Extension and Unzip Command Are Both Missing, Skipping."