PHP - Processing Invalid XML
What you need is something that will use libxml's internal errors to locate invalid characters and escape them accordingly. Here's a mockup of how I'd write it. Take a look at the result of libxml_get_errors()
for error info.
function load_invalid_xml($xml)
{
$use_internal_errors = libxml_use_internal_errors(true);
libxml_clear_errors(true);
$sxe = simplexml_load_string($xml);
if ($sxe)
{
return $sxe;
}
$fixed_xml = '';
$last_pos = 0;
foreach (libxml_get_errors() as $error)
{
// $pos is the position of the faulty character,
// you have to compute it yourself
$pos = compute_position($error->line, $error->column);
$fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($xml[$pos]);
$last_pos = $pos + 1;
}
$fixed_xml .= substr($xml, $last_pos);
libxml_use_internal_errors($use_internal_errors);
return simplexml_load_string($fixed_xml);
}
parse invalid XML manually
DOMDocument::loadHTML
method is more lenient than the XML parser and is able to automatically fix many errors. The problem is that you have no control on how libxml will fix these errors.
That's why I suggest an other approach with DOMDocument::loadXML
(that uses the XML parser), but this time I will try to correct errors with custom rules (that aren't universal fixes but are adapted to the specific situation)
When you switch libxml_use_internal_errors()
to true
, all xml errors are stored in an array of libXMLErr
instances. Each of them contains an error code, the error line and the error column. (Note that the first line and the first column are 1).
$xml = file_get_contents('file.xml');
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadXML($xml);
$errors = libxml_get_errors();
if ($errors) {
// LIBXML constant name, LIBXML error code // LIBXML error message
define('XML_ERR_LT_IN_ATTRIBUTE', 38); // Unescaped '<' not allowed in attributes values
define('XML_ERR_ATTRIBUTE_WITHOUT_VALUE', 41); // Specification mandate value for attribute
define('XML_ERR_NAME_REQUIRED', 68); // xmlParseEntityRef: no name
$rules = [
XML_ERR_LT_IN_ATTRIBUTE => [
'pattern' => '~(?:(?!\A)|.{%d}")[^<"]*\K<~A',
'replacement' => [ 'string' => '<', 'size' => 3 ]
],
XML_ERR_ATTRIBUTE_WITHOUT_VALUE => [
'pattern' => '~^.{%d}\h+\w+\h*=\h*"[^"]*\K"([^"]*)"~',
'replacement' => [ 'string' => '"$1"', 'size' => 10 ]
],
XML_ERR_NAME_REQUIRED => [
'pattern' => '~^.{%d}[^&]*\K&~',
'replacement' => [ 'string' => '&', 'size' => 4 ]
]
];
$previousLineNo = 0;
$lines = explode("\n", $xml);
foreach ($errors as $error) {
if (!isset($rules[$error->code])) continue;
$currentLineNo = $error->line;
if ( $currentLineNo != $previousLineNo )
$offset = -1;
$currentLine = &$lines[$currentLineNo - 1];
$pattern = sprintf($rules[$error->code]['pattern'], $error->column + $offset);
$currentLine = preg_replace($pattern,
$rules[$error->code]['replacement']['string'],
$currentLine, -1, $count);
$offset += $rules[$error->code]['replacement']['size'] * $count;
$previousLineNo = $currentLineNo;
}
$xml = implode("\n", $lines);
libxml_clear_errors();
$dom->loadXML($xml);
$errors = libxml_get_errors();
}
var_dump($errors);
$s = simplexml_import_dom($dom);
echo $s->product[0]["name"];
The size
in the rules array is the difference between the size of the replacement string and the size of the replaced string. This way when there are several errors on the same line, the position of the next error is updated with $offset
.
libxml error constants are not available in PHP, this is the reason why they are manually defined (only to make the code more readable). You can find them here.
Load an invalid XML in PHP DOM
First, check that it's the &
that's causing the error and not something else.
One way or another, you'll have to modify the XML to get it parsed. The HTML in loadHTML
is loaded from a string, can't you just replace the invalid characters with the correct ones?
If your installation supports the PHP Tidy extension (http://php.net/manual/en/book.tidy.php) you could try to clean it up with that, though in my experience it's far from foolproof.
check if xml is invalid from the response php
You can use libxml_use_internal_errors(true)
to turn off errors and use libxml_get_errors()
to to fetch error information as needed.
http://php.net/manual/en/function.libxml-use-internal-errors.php
PHP - Won't load invalid XML into DOM Document
Your code is not causing parsing errors (most likely not, if you enable error logging or reporting, you might have seen a warning, but I don't think it's the case).
Instead, your code loads and as XML per default is UTF-8 encoded, all those entities you use do not have to be transported as the XML can contain the characters of those entities without the need of these.
Therefore both the definition as well as the entities itself inside the XML are superfluous. I guess DOMDocument
will just remove those.
Additionally if you would have given an example XML chunk for testing purposes, you would have gotten a more concrete answer for your clarification needs.
Related Topics
How to Display an Blob Image Stored in MySQL Database
Long Integer Is Transformed When Inserted in Shorter Column, Not Truncated. Why? What Is the Formula
PHP Readdir Problem with Japanese Language File Name
Resetting MySQL Root Password with Xampp on Localhost
Why, Fatal Error: Class 'Phpunit_Framework_Testcase' Not Found in ...
How to Convert These Strange Characters? (ë, Ã, ì, ù, Ã)
Wampserver PHPmyadmin Maximum Execution Time of 360 Seconds Exceeded
Checking If Process Still Running
Get Timestamp of Today and Yesterday in PHP
Get Filename of File Which Ran PHP Include
Scale Image Using PHP and Maintaining Aspect Ratio
Unexpected T_Encapsed_And_Whitespace, Expecting T_String or T_Variable or T_Num_String Error
How to Get Unix Timestamp in PHP Based on Timezone
How to Access JSON Decoded Array in PHP
Trying to Get Property of Non-Object - Codeigniter
Regex/ Code to Fix Corrupt Serialized PHP Data