Disable Warnings When Loading Non-well-formed HTML by Domdocument (PHP)

Disable warnings when loading non-well-formed HTML by DomDocument (PHP)

You can install a temporary error handler with set_error_handler

class ErrorTrap {
  protected $callback;
  protected $errors = array();
  function __construct($callback) {
    $this->callback = $callback;
  }
  function call() {
    $result = null;
    set_error_handler(array($this, 'onError'));
    try {
      $result = call_user_func_array($this->callback, func_get_args());
    } catch (Exception $ex) {
      restore_error_handler();        
      throw $ex;
    }
    restore_error_handler();
    return $result;
  }
  function onError($errno, $errstr, $errfile, $errline) {
    $this->errors[] = array($errno, $errstr, $errfile, $errline);
  }
  function ok() {
    return count($this->errors) === 0;
  }
  function errors() {
    return $this->errors;
  }
}

Usage:

// create a DOM document and load the HTML data
$xmlDoc = new DomDocument();
$caller = new ErrorTrap(array($xmlDoc, 'loadHTML'));
// this doesn't dump out any warnings
$caller->call($fetchResult);
if (!$caller->ok()) {
  var_dump($caller->errors());
}

PHP DOMDocument errors/warnings on html5-tags

No, there is no way of specifying a particular doctype to use, or to modify the requirements of the existing one.

Your best workable solution is going to be to disable error reporting with libxml_use_internal_errors:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML('...');
libxml_clear_errors();

Basic xPath getting lots of warnings

Disable warnings with libxml_use_internal_errors(true)
http://www.php.net/manual/en/function.libxml-use-internal-errors.php

It is malformed HTML, nothing you can really do about it if you do not control the HTML.

DOMDocument::loadHTML(): Empty string supplied as input

Alright @Bruce..I understand the issue now. What you want to do is test the value of file_get_contents()

<?php
error_reporting(-1);
ini_set("display_errors", 1);

$article_url = 'http://google.com';
if (isset($article_url)){
  $title = 'contact us';
  $str = @file_get_contents($article_url);
  // return an error
  if ($str === FALSE) {
    echo 'problem getting url';
    return false;
  }

  // Continue
  $test1 = str_word_count(strip_tags(strtolower($str)));
  if ($test1 === FALSE) $test = '0';

  if ($test1 > '550') {
    echo '<div><i class="fa fa-check-square-o" style="color:green"></i> This article has ' . $test1 . ' words.';
  } else {
    echo '<div><i class="fa fa-times-circle-o" style="color:red"></i> This article has ' . $test1 . ' words. You are required to have a minimum of 500 words.</div>';
  }

  $document = new DOMDocument();
  $libxml_previous_state = libxml_use_internal_errors(true);
  $document->loadHTML($str);
  libxml_use_internal_errors($libxml_previous_state);

  $tags = array ('h1', 'h2');
  $texts = array ();

  foreach($tags as $tag) {
    $elementList = $document->getElementsByTagName($tag);
    foreach($elementList as $element) {
      $texts[$element->tagName] = strtolower($element->textContent);
    }
  }

  if (in_array(strtolower($title),$texts)) {
    echo '<div><i class="fa fa-check-square-o" style="color:green"></i> This article used the correct title tag.</div>';
  } else {
    echo "no";
  }
}
?>

So if ($str === FALSE) { //return an error } and don't let the script continue. You could return false like I am doing or just do an if/else.

Having difficulties parsing dirty html code with PHP DOMDocument

There is no clean way to parse HTML with namespaces using DOMDocument without losing the namespaces but there are some workarounds:

Use another parser that accepts namespaces in HMTL code. Look here for a nice and detailed list of HTML parsers. This is probably the most efficient way to do it.
If you want to stick with DOMDocument you basically have to pre- and postprocess the code.
- Before you send the code to DOMDocument->loadHTML, use regex, loops or whatever you want to find all namespaced tags and add a custom attribute to the opening tags containing the namespace.
```
<fb:like send="true" width="450" show_faces="true"></fb:like>
```
  would then result in
```
<fb:like xmlNamespace="fb" send="true" width="450" show_faces="true"></fb:like>
```
- Now give the edited code to DOMDocument->loadHTML. It will strip out the namespaces but it will keep the attributes resulting in
```
<like xmlNamespace="fb" send="true" width="450" show_faces="true"></like>
```
- Now (again using regex, loops or whatever you want) find all tags with the attribute xmlNamespace and replace the attribute with the actual namespace. Don't forget to also add the namespace to the closing tags!

PHP domDocument parsing with HTML Table ( PHP Fatal error: Call to a member function getElementsByTagName() on a non-object)

The array returned by getElementsByTagName is zero-indexed, meaning that in this case, $tables[1] does not exist (you only have one table in the HTML, and that table is referred to as $tables[0]) so you need to change the definition of $rows to this:

$rows = $tables->item(0)->getElementsByTagName('tr');

You also have an error in the loop; you can't refer to a DOMNodelist with an index like you are. You'd need to change the assignment of $betreffzeile to this: $betreffzeile.=$cols->item(2)->nodeValue;

Hope this helps.

domDocument is not returning node information

You are looking for

$dom->documentElement

this will return a

DOMNode

object.

Also: Get rid of the htmlentities because this will mess up the HTML code you fetch. e.g.: < will get <, which your loadHTML won't interpret as a <. Take a look at: Disable warnings when loading non-well-formed HTML by DomDocument (PHP)

Dummy-Dump:

function dump(DOMNode $node)
{
    echo $node->nodeName;
    if ($node->hasChildNodes())
    {
        echo '<div style="margin-left:20px; border-left:1px solid black; padding-left: 5px;">';
        foreach ($node->childNodes as $childNode)
        {
            dump($childNode);
        }
        echo '</div>';
    }
}

dump($dom->documentElement);

Which looks like:

Dummy-Dump

Parsing html code with html error problem

The page is written in very old HTML code (you can tell by the FONT tags, capitalization, etc.) and so <br> tags and probably paragraphs and other things as well, are not closed. I recommend using regular expressions to find them in this case.

Disable Warnings When Loading Non-well-formed HTML by Domdocument (PHP)