Fix Malformed Xml in PHP Before Processing Using Domdocument Functions

Fix malformed XML in PHP before processing using DOMDocument functions

Try using the Tidy library which can be used to clean up bad HTML and XML
http://php.net/manual/en/book.tidy.php

A pure PHP solution to fix some XML like this:

<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test < texter</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>

Would be something like this:

  function cleanupXML($xml) {
$xmlOut = '';
$inTag = false;
$xmlLen = strlen($xml);
for($i=0; $i < $xmlLen; ++$i) {
$char = $xml[$i];
// $nextChar = $xml[$i+1];
switch ($char) {
case '<':
if (!$inTag) {
// Seek forward for the next tag boundry
for($j = $i+1; $j < $xmlLen; ++$j) {
$nextChar = $xml[$j];
switch($nextChar) {
case '<': // Means a < in text
$char = htmlentities($char);
break 2;
case '>': // Means we are in a tag
$inTag = true;
break 2;
}
}
} else {
$char = htmlentities($char);
}
break;
case '>':
if (!$inTag) { // No need to seek ahead here
$char = htmlentities($char);
} else {
$inTag = false;
}
break;
default:
if (!$inTag) {
$char = htmlentities($char);
}
break;
}
$xmlOut .= $char;
}
return $xmlOut;
}

Which is a simple state machine noting whether we are in a tag or not and if not then encoding the text using htmlentities.

It's worth noting that this will be memory hungry on large files so you may want to rewrite it as a stream plugin or a pre-processor.

Modify existing XML file with DomDocument PHP

DOMDocument has no method asXML(). You need to use save():

$dom->save("vocab.xml");

Parsing XML-like document with PHP DOMDocument

You're looking for loadXML, not loadHTML.

No need to surround everything with HTML tags, just add a dummy <root> item instead, because any valid XML document must have one (you can also add it to the $xml variable itself).

Also, using @ before function calls should be avoided in 99% cases, it prevents you from seeing/understanding what's wrong.

The following should do it:

$doc->loadXML('<root>' . $xml . '</root>');

Demo here: https://3v4l.org/s8QvM

changing xml document lable (utf encoding) in php

not the absolute solution but a hack to solve this problem in such situations.

just replace the utf 16 to utf8 after loading the xml document as string and before further processing by using preg_replace function.

$source_xml_file = file_get_contents($incorrect_xml_file, TRUE);
$corrected_xml_file= preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $source_xml_file);
$xmldoc->loadXML($corrected_xml_file);

works in my case.

QueryPath or DOMDocument for editing large XML files?

QueryPath is basically just a wrapper around DOMDocument. It adds relatively little overhead to a bare DOMDocument object. For accessing and writing operations -- things like attr(), append(), and such, there should be no noteworthy performance difference.

But then it comes to the big issue: Finding stuff.

Traditionally, traversing a DOMDocument is done by either "walking the tree" or using DOMNode->getElementsByTagname(). This preforms relatively well if you're willing to write the code.

Querying with QueryPath 2.x will be sorta slow on a document that size unless you use very specific selectors (e.g ':root>foo>bar>baz').

However, QueryPath 3.x, which is about to go into Alpha1 is many, many times faster when querying large objects. Doing qp('foo') is as fast as XPath... which brings me to the last option.

Then there's the built-in XPath processor that also comes with PHP's libxml support. That might give you better performance if you're doing a large XML document, since it is executed at C speed instead of at PHP speed. But you will have to write XPath expressions, which are (IMHO) sort of a pain.

So the bottom line:

  • Basics: Either one will do.
  • Modification: Either one will do.
  • Lots of traversing:
    • DOMDocument will make you traverse manually.
    • QueryPath 2.x is slow
    • QueryPath 3.x is much faster
    • XPath is fastest... but it's XPath

Load an invalid XML in PHP DOM

First, check that it's the & that's causing the error and not something else.

One way or another, you'll have to modify the XML to get it parsed. The HTML in loadHTML is loaded from a string, can't you just replace the invalid characters with the correct ones?

If your installation supports the PHP Tidy extension (http://php.net/manual/en/book.tidy.php) you could try to clean it up with that, though in my experience it's far from foolproof.

DomDocument encoding

It's not a problem of the encoding of the XML-document, it's a problem of the encoding of the running PHP-script. Be sure that the php-script is also encoded in UTF-8 and the correct charset-header is sent to the browser.



Related Topics



Leave a reply



Submit