Generating Xml Document in PHP (Escape Characters)

Generating XML document in PHP (escape characters)

Use the DOM classes to generate your whole XML document. It will handle encodings and decodings that we don't even want to care about.


Edit: This was criticized by @Tchalvak:

The DOM object creates a full XML document, it doesn't easily lend itself to just encoding a string on it's own.

Which is wrong, DOMDocument can properly output just a fragment not the whole document:

$doc->saveXML($fragment);

which gives:

Test & <b> and encode </b> :)
Test &amp; <b> and encode </b> :)

as in:

$doc = new DOMDocument();
$fragment = $doc->createDocumentFragment();

// adding XML verbatim:
$xml = "Test & <b> and encode </b> :)\n";
$fragment->appendXML($xml);

// adding text:
$text = $xml;
$fragment->appendChild($doc->createTextNode($text));

// output the result
echo $doc->saveXML($fragment);

See Demo

Re-escape characters in an XML file

Because when the XML document is parsed the contents of that field still contain literal < and > [and likely other] metacharacters.

// the literal string you want to encode.
$string1 = "Now, heres litteral \"<\" and \">\" characters.... <><><<<>>";

// oops but I want to make sure I don't accidentally pass in HTML to RSS readers that might
// accidentally try to render it.
$string2 = htmlentities($string1);

// oh also I am writing XML directly instead of using a proper library to generate the document.
// I know that this is a really bad idea, but I'm sure I have my reasons.
// anywho, I should escape this text to be kludged directly into an XML doc.
$string3 = htmlentities($string2, ENT_XML1);

var_dump($string1, $string2, $string3);

Output:

string(56) "Now, heres litteral "<" and ">" characters.... <><><<<>>"
string(109) "Now, heres litteral "<" and ">" characters.... <><><<<>>"
string(169) "Now, heres litteral &quot;&lt;&quot; and &quot;&gt;&quot; characters.... &lt;&gt;&lt;&gt;&lt;&lt;&lt;&gt;&gt;"

$string2 should be as encoded as is necessary if you were feeding the data into something like an XMLDocument, DomDocument, or similar object, but since it look like you're doing things the hard way you're going to have to go all the way to $string3.

Special Character in XML using PHP

Here is no need to encode these characters. XML strings can use UTF-8 or another encoding. Depending on the encoding the serializer will encode as necessary.

$foo = new SimpleXmlElement('<?xml version="1.0" encoding="UTF-8"?><foo/>');
$foo->addChild('bar', 'μmol/l, x10³ cells/µl');
echo $foo->asXml();

Output (special characters not encoded):

<?xml version="1.0" encoding="UTF-8"?>
<foo><bar>μmol/l, x10³ cells/µl</bar></foo>

To force entities for the special characters, you need to change the encoding:

$foo = new SimpleXmlElement('<?xml version="1.0" encoding="ASCII"?><foo/>');
$foo->addChild('bar', 'μmol/l, x10³ cells/µl');
echo $foo->asXml();

Output (special characters encoded):

<?xml version="1.0" encoding="ASCII"?>
<foo><bar>μmol/l, x10³ cells/µl</bar></foo>

I suggest you convert your custom encoding back to UTF-8. That way the XML Api can take care of it. If you like to store string with the custom encoding you need to work around a bug.

A string like x10<su triggers a bug in SimpleXML/DOM. The second argument of SimpleXMLElement::addChild() and DOMDocument::createElement() have a broken escaping. You need to create the content as text node and append it.

Here is a small class that extends SimpleXMLElement and adds a workaround:

class MySimpleXMLElement extends SimpleXMLElement {

public function addChild($nodeName, $content = NULL) {
$child = parent::addChild($nodeName);
if (isset($content)) {
$node = dom_import_simplexml($child);
$node->appendChild($node->ownerDocument->createTextNode($content));
}
return $child;
}
}

$foo = new MySimpleXmlElement('<?xml version="1.0" encoding="UTF-8"?><foo/>');
$foo->addChild('bar', 'x10<su');
echo $foo->asXml();

Output:

<?xml version="1.0" encoding="UTF-8"?>
<foo><bar>&#120&#49&#48&#60&#115&#117</bar></foo>

The & from your custom encoding get escaped as the entity & - because it is an special character in XML. The XML parser will decode it.

$xml = <<<'XML'
<?xml version="1.0" encoding="UTF-8"?>
<foo><bar>&#120&#49&#48&#60&#115&#117</bar></foo>
XML;

$foo = new SimpleXMLElement($xml);
var_dump((string)$foo->bar);

Output:

string(27) "x10<su"

How do you make strings XML safe ?

By either escaping those characters with htmlspecialchars, or, perhaps more appropriately, using a library for building XML documents, such as DOMDocument or XMLWriter.

Another alternative would be to use CDATA sections, but then you'd have to look out for occurrences of ]]>.

Take also into consideration that that you must respect the encoding you define for the XML document (by default UTF-8).

How to Construct xml in php with Special Characters?

Try cleansing the input string using htmlspecialchars() (PHP spec) before adding it to the XML document which should give you the encoding you require.

However if you are not constrained in the way that you create the XML document (i.e. you can do it in a way other than pure string manipulation) I would follow prostynick's solution as it is much more robust in the long term.

What characters do I need to escape in XML documents?

If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.

XML escape characters

There are only five:

"   "
' '
< <
> >
& &

Escaping characters depends on where the special character is used.

The examples can be validated at the W3C Markup Validation Service.

Text

The safe way is to escape all five characters in text. However, the three characters ", ' and > needn't be escaped in text:

<?xml version="1.0"?>
<valid>"'></valid>

Attributes

The safe way is to escape all five characters in attributes. However, the > character needn't be escaped in attributes:

<?xml version="1.0"?>
<valid attribute=">"/>

The ' character needn't be escaped in attributes if the quotes are ":

<?xml version="1.0"?>
<valid attribute="'"/>

Likewise, the " needn't be escaped in attributes if the quotes are ':

<?xml version="1.0"?>
<valid attribute='"'/>

Comments

All five special characters must not be escaped in comments:

<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>

CDATA

All five special characters must not be escaped in CDATA sections:

<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>

Processing instructions

All five special characters must not be escaped in XML processing instructions:

<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>

XML vs. HTML

HTML has its own set of escape codes which cover a lot more characters.



Related Topics



Leave a reply



Submit