Error Tolerant HTML/XML/SGML parsing in PHP
You can suppress warnings with libxml_use_internal_errors
, while loading the document. Eg.:
libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);
If, for some reason, you need access to the warnings, use libxml_get_errors
Can simplexml be used to rifle through html?
You can use the loadHTML
function from the DOM module, and then import that DOM into SimpleXML via simplexml_import_dom
:
$html = file_get_contents('http://example.com/');
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);
Parsing HTML - is regex the only option in this case?
XPath can do both (1) and (2):
To test if there's a style tag in the body:
//body//style
To test if there's a div with a style attribute using width
or background-image
:
//div[contains(@style,'width:') or contains(@style,'background-image:')]
And, as you were curious about in your comments, seeing if a style tag contains a:hover
or font-size
:
//style[contains(text(),'a:hover') or contains(text(),'font-size:')]
Simplexml: parsing HTML leaves out nested elements inside an element with a text node
Your observation is correct: SimpleXML does only offer the child element-node here, not the child text-nodes. The solution is to switch to DOMDocument as it can access all nodes there, text and element children.
// first span element
$span = dom_import_simplexml($xml->span);
foreach ($span->childNodes as $child) {
printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}
This example shows that dom_import_simplexml
is used on the more specific <span>
element-node and the traversal is the done over the children of the according DOMElement object.
The output:
- DOMText : Nista; nula. Isto
- DOMElement : zilch; zip.
- DOMText :
The first entry is the first text-node within the <span>
element. It is followed by the <b>
element (which again contains some text) and then from another text-node that consists of whitespace only.
The dom_import_simplexml
function is especially useful when SimpleXMLElement is too simple for more differentiated data access within the XML document. Like in the case you face here.
The example in full:
$html = <<<HTML
<p>
<b>
<span>zot; zotz </span>
</b>
<span>Nista; nula. Isto
<b>zilch; zip.</b>
</span>
</p>
HTML;
$xml = simplexml_load_string($html);
// first span element
$span = dom_import_simplexml($xml->span);
foreach ($span->childNodes as $child) {
printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}
Recommendation for parsing HTML and SGML file
For HTMl Parser, use the HTML Agilty Pack - it is an open source HTML parser for .NET.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
You can use this to query HTML and extract whatever data you wish.
For SGML Parser
Check out this link, SGMLReader - Convert any HTML to valid XML:
http://developer.mindtouch.com/Community/SgmlReader
Reference: SGML parser .NET recommendations
Error tolerant XML reader
Look around HTML Parser, 'cause html is almost xml
PHP DOM append HTML to existing document without DOMDocumentFragment::appendXML
The solution that I came up with is to use DomDocument::loadHtml
as @FrankFarmer suggests and then to take the parsed nodes and import them into my current document. My implementation looks like this
/**
* Parses HTML into DOMElements
* @param string $html the raw html to transform
* @param \DOMDocument $doc the document to import the nodes into
* @return array an array of DOMElements on success or an empty array on failure
*/
protected function htmlToDOM($html, $doc) {
$html = '<div id="html-to-dom-input-wrapper">' . $html . '</div>';
$hdoc = DOMDocument::loadHTML($html);
$child_array = array();
try {
$children = $hdoc->getElementById('html-to-dom-input-wrapper')->childNodes;
foreach($children as $child) {
$child = $doc->importNode($child, true);
array_push($child_array, $child);
}
} catch (Exception $ex) {
error_log($ex->getMessage(), 0);
}
return $child_array;
}
extract image elements from html
I am in no way an expert on these matters (yet), but I hope this helps in some way.
According to this answer by troelskn you can make the DOM parser more tolerant to badly formed HTML by using libxml_use_internal_errors
. That might help you getting rid of that error.
Parsing all images of a document can be done by using DOMXPath
. It takes a DOMDocument
as a parameter and lets you run XPath queries on the document.
$document = new DOMDocument();
$document->loadHTML($your_html);
// Suppress parse errors.
libxml_use_internal_errors(false);
$xpath = new DOMXPath($document)
// Find all img tags.
$img_nodes = $xpath->query('//img')
DOMXPath::query
returns a DOMNodeList
which can be looped through using DOMNodeList::item
, which returns a DOMNode
.
for($i = 0; $i > $img_nodes->length; $i++)
{
$node = $img_nodes->item($i);
// Manipulate the node.
}
Disclaimer: The code I posted is untested and was put together using the manual.
Related Topics
Using PHP, How to Insert Text Without Overwriting to the Beginning of a Text File
How to Remove a Password from a PDF File Using PHP
PHP - Override Existing Function
How to Select a Result from the Select2 Search Results
Php: What Is the Purpose of Session_Name
Is It Acceptable to Use a Mix of Object Oriented Style with Procedural Style in Coding PHP
How to Use Facebook Graph API to Retrieve Fan Photos Uploaded to Wall of Fan Page
Pdo's Rowcount() Not Working on PHP 5.2.6+
Including PHP Variables in an External Js File
Setting $_Session Doesn't Work on Localhost Using Xampp
How to Validate Non-English (Utf-8) Encoded Email Address in JavaScript and PHP
Pass Base64 Jpeg Image to Og:Image
How to Remove Empty Values from Multidimensional Array in PHP
Php: Equivalent of Include Using Eval
Php: How to Check If the Client Is Local
From the String Name of a Class, How to Get a Static Variable