Fix malformed XML in PHP before processing using DOMDocument functions
Try using the Tidy library which can be used to clean up bad HTML and XML
http://php.net/manual/en/book.tidy.php
A pure PHP solution to fix some XML like this:
<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test < texter</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>
Would be something like this:
function cleanupXML($xml) {
$xmlOut = '';
$inTag = false;
$xmlLen = strlen($xml);
for($i=0; $i < $xmlLen; ++$i) {
$char = $xml[$i];
// $nextChar = $xml[$i+1];
switch ($char) {
case '<':
if (!$inTag) {
// Seek forward for the next tag boundry
for($j = $i+1; $j < $xmlLen; ++$j) {
$nextChar = $xml[$j];
switch($nextChar) {
case '<': // Means a < in text
$char = htmlentities($char);
break 2;
case '>': // Means we are in a tag
$inTag = true;
break 2;
}
}
} else {
$char = htmlentities($char);
}
break;
case '>':
if (!$inTag) { // No need to seek ahead here
$char = htmlentities($char);
} else {
$inTag = false;
}
break;
default:
if (!$inTag) {
$char = htmlentities($char);
}
break;
}
$xmlOut .= $char;
}
return $xmlOut;
}
Which is a simple state machine noting whether we are in a tag or not and if not then encoding the text using htmlentities.
It's worth noting that this will be memory hungry on large files so you may want to rewrite it as a stream plugin or a pre-processor.
Modify existing XML file with DomDocument PHP
DOMDocument
has no method asXML()
. You need to use save()
:
$dom->save("vocab.xml");
Parsing XML-like document with PHP DOMDocument
You're looking for loadXML
, not loadHTML
.
No need to surround everything with HTML tags, just add a dummy <root>
item instead, because any valid XML document must have one (you can also add it to the $xml
variable itself).
Also, using @
before function calls should be avoided in 99% cases, it prevents you from seeing/understanding what's wrong.
The following should do it:
$doc->loadXML('<root>' . $xml . '</root>');
Demo here: https://3v4l.org/s8QvM
changing xml document lable (utf encoding) in php
not the absolute solution but a hack to solve this problem in such situations.
just replace the utf 16 to utf8 after loading the xml document as string and before further processing by using preg_replace function.
$source_xml_file = file_get_contents($incorrect_xml_file, TRUE);
$corrected_xml_file= preg_replace('/(<\?xml[^?]+?)utf-16/i', '$1utf-8', $source_xml_file);
$xmldoc->loadXML($corrected_xml_file);
works in my case.
QueryPath or DOMDocument for editing large XML files?
QueryPath is basically just a wrapper around DOMDocument
. It adds relatively little overhead to a bare DOMDocument
object. For accessing and writing operations -- things like attr()
, append()
, and such, there should be no noteworthy performance difference.
But then it comes to the big issue: Finding stuff.
Traditionally, traversing a DOMDocument
is done by either "walking the tree" or using DOMNode->getElementsByTagname()
. This preforms relatively well if you're willing to write the code.
Querying with QueryPath 2.x will be sorta slow on a document that size unless you use very specific selectors (e.g ':root>foo>bar>baz').
However, QueryPath 3.x, which is about to go into Alpha1 is many, many times faster when querying large objects. Doing qp('foo')
is as fast as XPath... which brings me to the last option.
Then there's the built-in XPath processor that also comes with PHP's libxml support. That might give you better performance if you're doing a large XML document, since it is executed at C speed instead of at PHP speed. But you will have to write XPath expressions, which are (IMHO) sort of a pain.
So the bottom line:
- Basics: Either one will do.
- Modification: Either one will do.
- Lots of traversing:
- DOMDocument will make you traverse manually.
- QueryPath 2.x is slow
- QueryPath 3.x is much faster
- XPath is fastest... but it's XPath
Load an invalid XML in PHP DOM
First, check that it's the &
that's causing the error and not something else.
One way or another, you'll have to modify the XML to get it parsed. The HTML in loadHTML
is loaded from a string, can't you just replace the invalid characters with the correct ones?
If your installation supports the PHP Tidy extension (http://php.net/manual/en/book.tidy.php) you could try to clean it up with that, though in my experience it's far from foolproof.
DomDocument encoding
It's not a problem of the encoding of the XML-document, it's a problem of the encoding of the running PHP-script. Be sure that the php-script is also encoded in UTF-8 and the correct charset-header is sent to the browser.
Related Topics
Calling C/C++ Library Function from PHP
Regex Pattern to Get the Youtube Video Id from Any Youtube Url
PHP Startup Unable to Load Dynamic Library PHP_Mongo.Dll
Parse Xml Namespaces with PHP Simplexml
How to Install the Ext-Curl Extension with PHP 7
Laravel 5.5 the Page Has Expired Due to Inactivity. Please Refresh and Try Again
How Add Class='Active' to HTML Menu with PHP
Use JavaScript to Access a Variable Passed Through Twig
Laravel Preg_Match(): No Ending Delimiter '/' Found
How to Modify Xml File Using PHP
Posting JSON Objects to Symfony 2
PHP Remove Duplicate Values from Multidimensional Array
Will Copy-On-Write Prevent Data Duplication on Arrays
Formatting Phone Numbers in PHP
Unique and Temporary File Names in PHP
PHP - Get Base64 Img String Decode and Save as Jpg (Resulting Empty Image )