Is HTMLentities() Sufficient for Creating Xml-Safe Values

Is htmlentities() sufficient for creating xml-safe values?

htmlentities() is not a guaranteed way to build legal XML.

Use htmlspecialchars() instead of htmlentities() if this is all you are worried about. If you have encoding mismatches between the representation of your data and the encoding of your XML document, htmlentities() may serve to work around/cover them up (it will bloat your XML size in doing so). I believe it's better to get your encodings consistent and just use htmlspecialchars().

Also, be aware that if you pump the return value of htmlspecialchars() inside XML attributes delimited with single quotes, you will need to pass the ENT_QUOTES flag as well so that any single quotes in your source string are properly encoded as well. I suggest doing this anyway, as it makes your code immune to bugs resulting from someone using single quotes for XML attributes in the future.

Edit: To clarify:

htmlentities() will convert a number of non-ANSI characters (I assume this is what you mean by UTF-8 data) to entities (which are represented with just ANSI characters). However, it cannot do so for any characters which do not have a corresponding entity, and so cannot guarantee that its return value consists only of ANSI characters. That's why I 'm suggesting to not use it.

If encoding is a possible issue, handle it explicitly (e.g. with iconv()).

Edit 2: Improved answer taking into account Josh Davis's comment belowis .

PHP not have a function for XML-safe entity decode? Not have some xml_entity_decode?

This question is creating, time-by-time, a "false answer" (see answers). This is perhaps because people not pay attention, and because there are NO ANSWER: there are a lack of PHP build-in solution.

... So, lets repeat my workaround (that is NOT an answer!) to not create more confusion:

The best workaround

Pay attention:

The function xml_entity_decode() below is the best (over any other) workaround.
The function below is not an answer to the present question, it is only a workwaround.

  function xml_entity_decode($s) {
  // illustrating how a (hypothetical) PHP-build-in-function MUST work
    static $XENTITIES = array('&','>','<');
    static $XSAFENTITIES = array('#_x_amp#;','#_x_gt#;','#_x_lt#;');
    $s = str_replace($XENTITIES,$XSAFENTITIES,$s); 
    $s = html_entity_decode($s, ENT_HTML5|ENT_NOQUOTES, 'UTF-8'); // PHP 5.3+
    $s = str_replace($XSAFENTITIES,$XENTITIES,$s);
    return $s;
 }

To test and to demonstrate that you have a better solution, please test first with this simple benckmark:

  $countBchMk_MAX=1000;
  $xml = file_get_contents('sample1.xml'); // BIG and complex XML string
  $start_time = microtime(TRUE);
  for($countBchMk=0; $countBchMk<$countBchMk_MAX; $countBchMk++){

    $A = xml_entity_decode($xml); // 0.0002

    /* 0.0014
     $doc = new DOMDocument;
     $doc->loadXML($xml, LIBXML_DTDLOAD | LIBXML_NOENT);
     $doc->encoding = 'UTF-8';
     $A = $doc->saveXML();
    */

  }
  $end_time = microtime(TRUE);
  echo "\n<h1>END $countBchMk_MAX BENCKMARKs WITH ",
     ($end_time  - $start_time)/$countBchMk_MAX, 
     " seconds</h1>";

How do you make strings XML safe?

By either escaping those characters with htmlspecialchars, or, perhaps more appropriately, using a library for building XML documents, such as DOMDocument or XMLWriter.

Another alternative would be to use CDATA sections, but then you'd have to look out for occurrences of ]]>.

Take also into consideration that that you must respect the encoding you define for the XML document (by default UTF-8).

php output xml produces parse error ’

You take it the wrong way - don't look for a parser which doesn't give you errors. Instead try to have a well-formed xml.

How did you get ’ from the user? If he literally typed it in, you are not processing the input correctly - for example you should escape & to &. If it is you who put the entity there (perhaps in place of some apostrophe), either define it in DTD (<!ENTITY rsquo "&x2019;">) or write it using a numeric notation (’), because almost every of the named entities are a part of HTML. XML defines only a few basic ones, as Gumbo pointed out.

EDIT based on additions to the question:

In #1, you escape the content in the way that if user types in ]]> <°)))><, you have a problem.
In #2, ~~you are doing the encoding and decoding which result in the original value of the $content.~~ the decoding should not be necessary (if you don't expect users to post values like & which should be interpreted like &).

If you use htmlspecialchars() with ENT_QUOTES, it should be ok, but see how Drupal does it.

Encoding special chars, DOMDocument XML and PHP

The second argument of DOMDocument::createElement() is broken - it only escapes partly and it is not part of the W3C DOM standard. In DOM the text content is a node. You can just create it and append it to the element node. This works with other node types like CDATA sections or comments as well. DOMNode::appendChild() returns the appended node, so you can nest and chain the calls.

Additionally you can set the DOMElement::$textContent property. This will replace all descendant nodes with a single text node. Do not use DOMElement::$nodeValue - it has the same problems as the argument.

$document = new DOMDocument();
$document->formatOutput = true;
$root = $document->appendChild($document->createElement('foo'));
$root
   ->appendChild($document->createElement('one'))
   ->appendChild($document->createTextNode('"foo" & <bar>'));
$root
   ->appendChild($document->createElement('one'))
   ->textContent = '"foo" & <bar>';
$root
   ->appendChild($document->createElement('two'))
   ->appendChild($document->createCDATASection('"foo" & <bar>'));
$root
   ->appendChild($document->createElement('three'))
   ->appendChild($document->createComment('"foo" & <bar>'));

echo $document->saveXML();

Output:

<?xml version="1.0"?>
<foo>
  <one>"foo" & <bar></one>
  <one>"foo" & <bar></one>
  <two><![CDATA["foo" & <bar>]]></two>
  <three>
    <!--"foo" & <bar>-->
  </three>
</foo>

This will escape special characters (like & and <) as needed. Quotes do need to be escaped so they won't. Other special characters depend on the encoding.

$document = new DOMDocument("1.0", "UTF-8");
$document
   ->appendChild($document->createElement('foo'))
   ->appendChild($document->createTextNode('äöü'));
echo $document->saveXML();

$document = new DOMDocument("1.0", "ASCII");
$document
   ->appendChild($document->createElement('foo'))
   ->appendChild($document->createTextNode('äöü'));
echo $document->saveXML();

Output:

<?xml version="1.0" encoding="UTF-8"?> 
<foo>äöü</foo> 
<?xml version="1.0" encoding="ASCII"?> 
<foo>äöü</foo>

Is folder/.htmlentities($_GET[location])..filetype safe?

NO. As far as I can remember, htmlentities() doesn't escape the null-byte \0, the dot ., the slash / and the backslash \.

The null-byte truncates your path/file name. And the dot makes for relative path traversing (. = current directory, .. = parent directory, \ (Windows) and / = root directory).

Now, consider your inserted variable to be ../config.php\0 and be afraid if you were planning even serving that 'image'.

What do the ENT_HTML5, ENT_HTML401, ... modifiers on html_entity_decode do?

I started wondering what behavior these constants have when I saw these constants at the htmlspecialchars page. The documentation was rubbish, so I started digging in the source code of PHP.

Basically, these constants affect whether certain entities are encoded or not (or decoded for html_entity_decode). The most obvious effect is whether the apostrophe (') is encoded to ' (for ENT_HTML401) or ' (for others). Similarly, it determines whether ' is decoded or not when using html_entity_decode. (' is always decoded).

All usages can be found in ext/standard/html.c and its header file. From ext/standard/html.h:

#define ENT_HTML_DOC_HTML401            0
#define ENT_HTML_DOC_XML1                       16
#define ENT_HTML_DOC_XHTML                      32
#define ENT_HTML_DOC_HTML5                      (16|32)

(replace ENT_HTML_DOC_ by ENT_ to get their PHP constant names)

I started looking for all occurrences of these constants, and can share the following on the behaviour of the ENT_* constants:

It affects which numeric entities will be decoded or not. For example, ﷐ gets decoded to an unreadable/invalid character for ENT_HTML401, and ENT_XHTML and ENT_XML1. For ENT_HTML5 however, this is considered an invalid character and hence it stays ﷐. (C function unicode_cp_is_allowed)
With ENT_SUBSTITUTE enabled, invalid code unit sequences for a specified character set are replaced with �. (does not depend on document type!)
With ENT_DISALLOWED enabled, code points that are disallowed for the specified document type are replaced with �. (does not depend on charset!)
With ENT_IGNORE, the same invalid code unit sequences from ENT_SUBSTITUTE are removed and no replacement is done (depends on choice of "document type", e.g. ENT_HTML5)
Disallow for ENT_HTML5 (line 976)
ENT_XHTML shares the entity map with ENT_HTML401. The only difference is that ' will be converted to an apostrophe with ENT_XHTML while ENT_HTML401 does not convert it (see this line)
ENT_HTML401 and ENT_XHTML use exactly the same entity map (minus the difference from the previous point). ENT_HTML5 uses its own map. Others (currently ENT_XML1) have a very limited decoding map (>, &, <, ', " and their numeric equivalents). (see C function unescape_inverse_map)
Note for the previous point: when only a few entities must be escaped (think of htmlspecialchars), all entities map will use the same one as ENT_XML1, except for ENT_HTML401. That one will not use ', but '.

That covers almost everything. I am not going to list all entity differences, instead I would like to point at https://github.com/php/php-src/tree/php-5.4.11/ext/standard/html_tables for some text files that contain the mappings for each type.

What ENT_* should I use for htmlspecialchars?

When using htmlspecialchars with ENT_COMPAT (default) or ENT_NOQUOTES, it does not matter which one you pick (see below). I saw some answers here on SO that boils down to this:

<input value="<?php echo htmlspecialchars($str, ENT_HTML5);?>" >

This is insecure. It will override the default value ENT_HTML401 | ENT_COMPAT which has as difference that HTML5 entities are used, but also that quotes are not escaped anymore! In addition, this is redundant code. The entities that have to be encoded by htmlspecialchars are the same for all ENT_HTML401, ENT_HTML5, etc.

Just use ENT_COMPAT or ENT_QUOTES instead. The latter also works when you use apostrophes for attributes (value='foo'). If you only have two arguments for htmlspecialchars, do not include the argument at all since it is the default (ENT_HTML401 is 0, remember?).

When you want to print something on the page (between tags, not attributes), it does not matter at all which one you pick as it will have equal effect. It is even sufficient to use ENT_NOQUOTES | ENT_HTML401 which equals to the numeric value 0.

See also below, about ENT_SUBTITUTE and ENT_DISALLOWED.

What ENT_* should I use for htmlentities?

If your text editor or database is so crappy that you cannot include non-US-ASCII characters (e.g. UTF-8), you can use htmlentities. Otherwise, save some bytes and use htmlspecialchars instead (see above).

Whether you need to use ENT_HTML401, ENT_HTML5 or something else depends on how your page is served. When you have a HTML5 page (<!doctype html>), use ENT_HTML5. XHTML or XML? Use the corresponding ENT_XHTML or ENT_XML1. With no doctype or plain ol' HTML4, use ENT_HTML401 (which is the default when omitted).

Should I use ENT_DISALLOWED, ENT_IGNORE or ENT_SUBSTITUTE?

By default, byte sequences that are invalid for the given character set are removed. To have a � in place of an invalid byte sequence, specify ENT_SUBSTITUTE. (note that &#FFFD; is shown for non-UTF-8 charsets). When you specify ENT_IGNORE though, these characters are not shown even if you specified ENT_SUBSTITUTE.

Invalid characters for a document type are substituted by the same replacement character (or its entity) above when ENT_DISALLOWED is specified. This happens regardless of having ENT_IGNORE set (which has nothing to do with invalid chars for doctypes).

When used correctly, is htmlspecialchars sufficient for protection against all XSS?

htmlspecialchars() is enough to prevent document-creation-time HTML injection with the limitations you state (ie no injection into tag content/unquoted attribute).

However there are other kinds of injection that can lead to XSS and:

There are no <script> tags in the document.

this condition doesn't cover all cases of JS injection. You might for example have an event handler attribute (requires JS-escaping inside HTML-escaping):

<div onmouseover="alert('<?php echo htmlspecialchars($xss) ?>')"> // bad!

or, even worse, a javascript: link (requires JS-escaping inside URL-escaping inside HTML-escaping):

<a href="javascript:alert('<?php echo htmlspecialchars($xss) ?>')"> // bad!

It is usually best to avoid these constructs anyway, but especially when templating. Writing <?php echo htmlspecialchars(urlencode(json_encode($something))) ?> is quite tedious.

And... injection issues can happen on the client-side as well (DOM XSS); htmlspecialchars() won't protect you against a piece of JavaScript writing to innerHTML (commonly .html() in poor jQuery scripts) without explicit escaping.

And... XSS has a wider range of causes than just injections. Other common causes are:

allowing the user to create links, without checking for known-good URL schemes (javascript: is the most well-known harmful scheme but there are more)
deliberately allowing the user to create markup, either directly or through light-markup schemes (like bbcode which is invariably exploitable)
allowing the user to upload files (which can through various means be reinterpreted as HTML or XML)

retrieving an xml file from mysql and using it in as a variable with php

I was able to get the result I wanted using html_entity_decode. Thank you

XML Entity not defined in Chrome

Apparently using htmlspecialchars and htmlentities in tandem does the trick.

htmlspecialchars(htmlentities($value));

Is HTMLentities() Sufficient for Creating Xml-Safe Values