How to Skip Invalid Characters in Xml File Using PHP

How to skip invalid characters in XML file using PHP

Do you have control over the XML? If so, ensure the data is enclosed in <![CDATA[ .. ]]> blocks.

And you also need to clear the invalid characters:

/**
* Removes invalid XML
*
* @access public
* @param string $value
* @return string
*/
function stripInvalidXml($value)
{
$ret = "";
$current;
if (empty($value))
{
return $ret;
}

$length = strlen($value);
for ($i=0; $i < $length; $i++)
{
$current = ord($value[$i]);
if (($current == 0x9) ||
($current == 0xA) ||
($current == 0xD) ||
(($current >= 0x20) && ($current <= 0xD7FF)) ||
(($current >= 0xE000) && ($current <= 0xFFFD)) ||
(($current >= 0x10000) && ($current <= 0x10FFFF)))
{
$ret .= chr($current);
}
else
{
$ret .= " ";
}
}
return $ret;
}

How to skip/remove invalid non-utf8 characters from a xml file

Answer from comment:

Try removing/escaping ampersands in this text or wrap it in a CDATA
block

So i put before call function

simplexml_load_string($string)
this

$string = str_replace('&', ' ', $string);.

and now it's work, no & in string so simplexml_load_string() can parse without errors.

PHP XMLReader stumbles upon invalid character and stops

Ended up finding a solution after all.

I decided to use fopen to construct & process on the fly. Here's what I ended up with:

$handle = fopen('compress.zlib://'.$file, 'r');
$xml_source = '';
$record = false;
if($handle){
while(($buffer = fgets($handle, 4096)) !== false){
if(strpos($buffer, '<open_tag>') > -1){
$xml_source = '<?xml version="1.0" encoding="UTF-8"?>';
$record = true;
}
if(strpos($buffer, '</close_tag') > -1){
$xml_source .= $buffer;
$record = false;
$xml = simplexml_load_string(stripInvalidXml($xml_source));

// ... do stuff here with the xml element

}
if($record){
$xml_source .= $buffer;
}

}
}

The function simplexml_load_string() is the one quickshiftin provided. Works like a charm.

PHP: removing invalid utf-8 characters in XML using filter

No, I don't think it will work. It will strip valid sequences of code units that happen to be split between several buckets.

It should not consume potentially incomplete sequences in the end (and, if necessary, it should pass nothing and return PSFS_FEED_ME).

How to to skip invalid XML file with incomplete closing tags in PHP

You were almost there. Try adding the command libxml_use_internal_errors(true); before everything to tell PHP not to throw errors but to cache them for you to iterate through as your code is doing.

Illegal character in XML feed?

0x03 (aka ^C aka ETX aka end of transmission) is not an allowed character in XML :

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Therefore your data is not XML, and any conformant XML processor must report an error such as the one you received.

You must repair the data by removing any illegal characters by treating it as text, not XML, manually or automatically before using it with any XML libraries.

How do you make strings XML safe?

By either escaping those characters with htmlspecialchars, or, perhaps more appropriately, using a library for building XML documents, such as DOMDocument or XMLWriter.

Another alternative would be to use CDATA sections, but then you'd have to look out for occurrences of ]]>.

Take also into consideration that that you must respect the encoding you define for the XML document (by default UTF-8).

How to handle invalid unicode with simplexml

I was running into this a lot with incoming user data, and I researched many methods to solve it. There are ways to properly encode the incoming data as UTF-8, without the higher-order (or other) unicode values that often cause these problems.

However, the problem with the sanitizing solutions is that they change the data, and if you just want to be a middle man, you still want the output to contain these values. The only non-destructive way I could come up with to get a SimpleXMLElement reliably not fail, is to do this admittedly double-work solution:

    libxml_use_internal_errors(true);
$dom = new DOMDocument("1.0", "UTF-8");
$dom->strictErrorChecking = false;
$dom->validateOnParse = false;
$dom->recover = true;
$dom->loadXML($xmlData);
$xml = simplexml_import_dom($dom);

libxml_clear_errors();
libxml_use_internal_errors(false);

The trick is in looking at the properties of DOMDocument in PHP's documentation and noticing those extra variables that let you set parsing behavior. This method works without fail for me, on all the xml input that used to make SimpleXMLElement fail with character range issues.

My only guess on why it works is that SimpleXMLElement does the strict checking on initialization, but not when being initialized from an existing DOMDocument.

This method allows subsequent asXML() calls, without failure.

What are invalid characters in XML

The only illegal characters are &, < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use " here, ' is allowed" and attr='must use ' here, " is allowed').

They're escaped using XML entities, in this case you want & for &.

Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.



Related Topics



Leave a reply



Submit