PHP - Processing Invalid Xml

PHP - Processing Invalid XML

What you need is something that will use libxml's internal errors to locate invalid characters and escape them accordingly. Here's a mockup of how I'd write it. Take a look at the result of libxml_get_errors() for error info.

function load_invalid_xml($xml)
{
$use_internal_errors = libxml_use_internal_errors(true);
libxml_clear_errors(true);

$sxe = simplexml_load_string($xml);

if ($sxe)
{
return $sxe;
}

$fixed_xml = '';
$last_pos = 0;

foreach (libxml_get_errors() as $error)
{
// $pos is the position of the faulty character,
// you have to compute it yourself
$pos = compute_position($error->line, $error->column);
$fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($xml[$pos]);
$last_pos = $pos + 1;
}
$fixed_xml .= substr($xml, $last_pos);

libxml_use_internal_errors($use_internal_errors);

return simplexml_load_string($fixed_xml);
}

parse invalid XML manually

DOMDocument::loadHTML method is more lenient than the XML parser and is able to automatically fix many errors. The problem is that you have no control on how libxml will fix these errors.

That's why I suggest an other approach with DOMDocument::loadXML (that uses the XML parser), but this time I will try to correct errors with custom rules (that aren't universal fixes but are adapted to the specific situation)

When you switch libxml_use_internal_errors() to true, all xml errors are stored in an array of libXMLErr instances. Each of them contains an error code, the error line and the error column. (Note that the first line and the first column are 1).

$xml = file_get_contents('file.xml');

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadXML($xml);
$errors = libxml_get_errors();

if ($errors) {
// LIBXML constant name, LIBXML error code // LIBXML error message
define('XML_ERR_LT_IN_ATTRIBUTE', 38); // Unescaped '<' not allowed in attributes values
define('XML_ERR_ATTRIBUTE_WITHOUT_VALUE', 41); // Specification mandate value for attribute
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name

$rules = [
XML_ERR_LT_IN_ATTRIBUTE => [
'pattern' => '~(?:(?!\A)|.{%d}")[^<"]*\K<~A',
'replacement' => [ 'string' => '<', 'size' => 3 ]
],
XML_ERR_ATTRIBUTE_WITHOUT_VALUE => [
'pattern' => '~^.{%d}\h+\w+\h*=\h*"[^"]*\K"([^"]*)"~',
'replacement' => [ 'string' => '"$1"', 'size' => 10 ]
],
XML_ERR_NAME_REQUIRED => [
'pattern' => '~^.{%d}[^&]*\K&~',
'replacement' => [ 'string' => '&', 'size' => 4 ]
]
];

$previousLineNo = 0;
$lines = explode("\n", $xml);

foreach ($errors as $error) {

if (!isset($rules[$error->code])) continue;

$currentLineNo = $error->line;

if ( $currentLineNo != $previousLineNo )
$offset = -1;

$currentLine = &$lines[$currentLineNo - 1];
$pattern = sprintf($rules[$error->code]['pattern'], $error->column + $offset);
$currentLine = preg_replace($pattern,
$rules[$error->code]['replacement']['string'],
$currentLine, -1, $count);
$offset += $rules[$error->code]['replacement']['size'] * $count;
$previousLineNo = $currentLineNo;
}

$xml = implode("\n", $lines);

libxml_clear_errors();
$dom->loadXML($xml);
$errors = libxml_get_errors();
}

var_dump($errors);

$s = simplexml_import_dom($dom);

echo $s->product[0]["name"];

The size in the rules array is the difference between the size of the replacement string and the size of the replaced string. This way when there are several errors on the same line, the position of the next error is updated with $offset.

libxml error constants are not available in PHP, this is the reason why they are manually defined (only to make the code more readable). You can find them here.

Load an invalid XML in PHP DOM

First, check that it's the & that's causing the error and not something else.

One way or another, you'll have to modify the XML to get it parsed. The HTML in loadHTML is loaded from a string, can't you just replace the invalid characters with the correct ones?

If your installation supports the PHP Tidy extension (http://php.net/manual/en/book.tidy.php) you could try to clean it up with that, though in my experience it's far from foolproof.

check if xml is invalid from the response php

You can use libxml_use_internal_errors(true) to turn off errors and use libxml_get_errors() to to fetch error information as needed.

http://php.net/manual/en/function.libxml-use-internal-errors.php

PHP - Won't load invalid XML into DOM Document

Your code is not causing parsing errors (most likely not, if you enable error logging or reporting, you might have seen a warning, but I don't think it's the case).

Instead, your code loads and as XML per default is UTF-8 encoded, all those entities you use do not have to be transported as the XML can contain the characters of those entities without the need of these.

Therefore both the definition as well as the entities itself inside the XML are superfluous. I guess DOMDocument will just remove those.

Additionally if you would have given an example XML chunk for testing purposes, you would have gotten a more concrete answer for your clarification needs.



Related Topics



Leave a reply



Submit