How to extract img src, title and alt from html using php?
EDIT : now that I know better
Using regexp to solve this kind of problem is a bad idea and will likely lead in unmaintainable and unreliable code. Better use an HTML parser.
Solution With regexp
In that case it's better to split the process into two parts :
- get all the img tag
- extract their metadata
I will assume your doc is not xHTML strict so you can't use an XML parser. E.G. with this web page source code :
/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */
preg_match_all('/<img[^>]+>/i',$html, $result);
print_r($result);
Array
(
[0] => Array
(
[0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
[1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
[3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
[4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[...]
)
)
Then we get all the img tag attributes with a loop :
$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}
print_r($img);
Array
(
[<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
)
)
Regexps are CPU intensive so you may want to cache this page. If you have no cache system, you can tweak your own by using ob_start and loading / saving from a text file.
How does this stuff work ?
First, we use preg_ match_ all, a function that gets every string matching the pattern and ouput it in it's third parameter.
The regexps :
<img[^>]+>
We apply it on all html web pages. It can be read as every string that starts with "<img
", contains non ">" char and ends with a >.
(alt|title|src)=("[^"]*")
We apply it successively on each img tag. It can be read as every string starting with "alt", "title" or "src", then a "=", then a ' " ', a bunch of stuff that are not ' " ' and ends with a ' " '. Isolate the sub-strings between ().
Finally, every time you want to deal with regexps, it handy to have good tools to quickly test them. Check this online regexp tester.
EDIT : answer to the first comment.
It's true that I did not think about the (hopefully few) people using single quotes.
Well, if you use only ', just replace all the " by '.
If you mix both. First you should slap yourself :-), then try to use ("|') instead or " and [^ø] to replace [^"].
Get img src with PHP
Use a HTML parser like DOMDocument
and then evaluate the value you're looking for with DOMXpath
:
$html = '<img id="12" border="0" src="/images/image.jpg"
alt="Image" width="100" height="100" />';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$src = $xpath->evaluate("string(//img/@src)"); # "/images/image.jpg"
Or for those who really need to save space:
$xpath = new DOMXPath(@DOMDocument::loadHTML($html));
$src = $xpath->evaluate("string(//img/@src)");
And for the one-liners out there:
$src = (string) reset(simplexml_import_dom(DOMDocument::loadHTML($html))->xpath("//img/@src"));
Using PHP to extract the alt and/or title attributes from images
Here's a solution using PHP's DOM parser:
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML(file_get_contents("http://stackoverflow.com"));
libxml_use_internal_errors(false);
$items = $domd->getElementsByTagName("img");
$data = array();
foreach($items as $item) {
$data[] = array(
"src" => $item->getAttribute("src"),
"alt" => $item->getAttribute("alt"),
"title" => $item->getAttribute("title"),
);
}
Get all images and return the src
Don't use regex, use a parser. Example:
$string = '<img src="You want this" style="width:200px;" />';
$doc = new DOMDocument();
$doc->loadHTML($string);
$images = $doc->getElementsByTagName('img');
foreach ($images as $image) {
echo $image->getAttribute('src') . "\n";
}
Output:
You want this
How can i get src alone from Image tag using php?
Load it into a DOM tree with loadHtml
and do an xpath on it, or directly traverse it if the structure is always the same.
See How to extract img src, title and alt from html using php?
Regex to capture img src in php
You can use this:
(width|height|src)=("[^"]*"|'[^']*')
I've basically used an alternation to either match "fds" or 'fds'.
Extract image name from img tag source
You can use XPath to only target the img nodes you want:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile($filePath, LIBXML_HTML_NODEFDTD);
// or $dom->loadHTML($htmlString, LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$nodeList = $xp->query('//img[starts-with(@src, "media/lib/pics/")]');
$newPath = 'my/new/path/';
foreach ($nodeList as $node) {
$imgFileName = basename($node->getAttribute('src'));
$imgNode = $dom->createElement('img'); // create a new img element to replace the old img node
$imgNode->setAttribute('src', $newPath . $imgFileName);
$node->parentNode->replaceChild($imgNode, $node);
}
$result = $dom->saveHTML();
XPath query details:
// # everywhere in the DOM tree
img # an img element
[ # open a predicate
starts-with(@src, "media/lib/pics/") # with a src attribute that starts with "media/lib/pics/"
] # close the predicate
extract image src from text?
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$imgs = $dom->getElementsByTagName("img");
$links = array();
for($i = 0; $i < $imgs->length; $i++) {
$links[] = $imgs->item($i)->getAttribute("src");
}
Related Topics
Get Order Items and Wc_Order_Item_Product in Woocommerce 3
How to Post Data in PHP Using File_Get_Contents
Can't Use Method Return Value in Write Context
Parse Error: Syntax Error, Unexpected End of File in My PHP Code
Form Submit With Ajax Passing Form Data to PHP Without Page Refresh
How to Call a Function from a String Stored in a Variable
Run PHP Script as Daemon Process
Error Message "Strict Standards: Only Variables Should Be Passed by Reference"
How to Make a Request Using Http Basic Authentication With PHP Curl
Use an Array in a MySQLi Prepared Statement: 'Where .. In(..)' Query
How to Upload File Using Curl With PHP