Error Tolerant HTML/Xml/Sgml Parsing in PHP

Error Tolerant HTML/XML/SGML parsing in PHP

You can suppress warnings with libxml_use_internal_errors, while loading the document. Eg.:

libxml_use_internal_errors(true);
$doc = new DomDocument();
$doc->loadHTML("<strong>This is an example of a <pseud-template>fake tag</pseud-template></strong>");
libxml_use_internal_errors(false);

If, for some reason, you need access to the warnings, use libxml_get_errors

Can simplexml be used to rifle through html?

You can use the loadHTML function from the DOM module, and then import that DOM into SimpleXML via simplexml_import_dom:

$html = file_get_contents('http://example.com/');
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);

Parsing HTML - is regex the only option in this case?

XPath can do both (1) and (2):

To test if there's a style tag in the body:

//body//style

To test if there's a div with a style attribute using width or background-image:

//div[contains(@style,'width:') or contains(@style,'background-image:')]

And, as you were curious about in your comments, seeing if a style tag contains a:hover or font-size:

//style[contains(text(),'a:hover') or contains(text(),'font-size:')]

Simplexml: parsing HTML leaves out nested elements inside an element with a text node

Your observation is correct: SimpleXML does only offer the child element-node here, not the child text-nodes. The solution is to switch to DOMDocument as it can access all nodes there, text and element children.

// first span element
$span = dom_import_simplexml($xml->span);

foreach ($span->childNodes as $child) {
    printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}

This example shows that dom_import_simplexml is used on the more specific <span> element-node and the traversal is the done over the children of the according DOMElement object.

The output:

 - DOMText : Nista; nula. Isto

 - DOMElement : zilch; zip.
 - DOMText :

The first entry is the first text-node within the <span> element. It is followed by the <b> element (which again contains some text) and then from another text-node that consists of whitespace only.

The dom_import_simplexml function is especially useful when SimpleXMLElement is too simple for more differentiated data access within the XML document. Like in the case you face here.

The example in full:

$html = <<<HTML
<p>
    <b>
        <span>zot; zotz </span>
    </b>
    <span>Nista; nula. Isto
        <b>zilch; zip.</b>
    </span>
</p>
HTML;

$xml = simplexml_load_string($html);

// first span element
$span = dom_import_simplexml($xml->span);

foreach ($span->childNodes as $child) {
    printf(" - %s : %s\n", get_class($child), $child->nodeValue );
}

Recommendation for parsing HTML and SGML file

For HTMl Parser, use the HTML Agilty Pack - it is an open source HTML parser for .NET.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).

You can use this to query HTML and extract whatever data you wish.

For SGML Parser

Check out this link, SGMLReader - Convert any HTML to valid XML:

http://developer.mindtouch.com/Community/SgmlReader

Reference: SGML parser .NET recommendations

Error tolerant XML reader

Look around HTML Parser, 'cause html is almost xml

PHP DOM append HTML to existing document without DOMDocumentFragment::appendXML

The solution that I came up with is to use DomDocument::loadHtml as @FrankFarmer suggests and then to take the parsed nodes and import them into my current document. My implementation looks like this

/**
* Parses HTML into DOMElements
* @param string $html the raw html to transform
* @param \DOMDocument $doc the document to import the nodes into
* @return array an array of DOMElements on success or an empty array on failure
*/
protected function htmlToDOM($html, $doc) {
     $html = '<div id="html-to-dom-input-wrapper">' . $html . '</div>';
     $hdoc = DOMDocument::loadHTML($html);
     $child_array = array();
     try {
         $children = $hdoc->getElementById('html-to-dom-input-wrapper')->childNodes;
         foreach($children as $child) {
             $child = $doc->importNode($child, true);
             array_push($child_array, $child);
         }
     } catch (Exception $ex) {
         error_log($ex->getMessage(), 0);
     }
     return $child_array;
 }

extract image elements from html

I am in no way an expert on these matters (yet), but I hope this helps in some way.

According to this answer by troelskn you can make the DOM parser more tolerant to badly formed HTML by using libxml_use_internal_errors. That might help you getting rid of that error.

Parsing all images of a document can be done by using DOMXPath. It takes a DOMDocument as a parameter and lets you run XPath queries on the document.

$document = new DOMDocument();
$document->loadHTML($your_html);

// Suppress parse errors.
libxml_use_internal_errors(false);

$xpath = new DOMXPath($document)

// Find all img tags.
$img_nodes = $xpath->query('//img')

DOMXPath::query returns a DOMNodeList which can be looped through using DOMNodeList::item, which returns a DOMNode.

for($i = 0; $i > $img_nodes->length; $i++)
{
    $node = $img_nodes->item($i);
    // Manipulate the node.
}

Disclaimer: The code I posted is untested and was put together using the manual.

Error Tolerant HTML/Xml/Sgml Parsing in PHP