Why Doesn't PHP Dom Include Slash on Self Closing Tags

Why doesn't PHP DOM include slash on self closing tags?

DOMDocument->saveHTML() takes your XML DOM infoset and writes it out as old-school HTML, not XML. You should not use saveHTML() together with an XHTML doctype, as its output won't be well-formed XML.

If you use saveXML() instead, you'll get proper XHTML. It's fine to serve this XML output to standards-compliant browsers if you give it a Content-Type: application/xhtml+xml header. But unfortunately IE6-8 won't be able to read that, as they can still only handle old-school HTML, under the text/html media type.

The usual compromise solution is to serve text/html and use ‘HTML-compatible XHTML’ as outlined in Appendix C of the XHTML 1.0 spec. But sadly there is no PHP DOMDocument->saveXHTML() method to generate the correct output for this.

There are some things you can do to persuade saveXML() to produce HTML-compatible output for some common cases. The main one is that you have to ensure that only elements defined by HTML4 as having an EMPTY content model (<img>, <br> etc) actually do have empty content, causing the self-closing syntax (<img/>) to be used. Other elements must not use the self-closing syntax, so if they're empty you should put a space in their text content to stop them being so:

<script src="x.js"/>           <-- no good, confuses HTML parser and breaks page
<script src="x.js"> </script> <-- fine

The other one to look out for is handling of the inline <script> and <style> elements, which are normal elements in XHTML but special CDATA-content elements in HTML. Some /*<![CDATA[*/.../*]]>*/ wrapping is required to make any < or & characters inside them behave mostly-consistently, though note you still have to avoid the ]]> and </ sequences.

If you want to really do it properly you would have to write your own HTML-compatible-XHTML serialiser. Long-term that would probably be a better option. But for small simple cases, hacking your input so that it doesn't contain anything that would come out the other end of an XML serialiser as incompatible with HTML is probably the quick solution.

That or just suck it up and live with old-school non-XML HTML, obviously.

illegal self closing node notation for empty nodes - outputting XHTML with PHP DOMDocument

function export_html(DOMDocument $dom)
{
$voids = ['area',
'base',
'br',
'col',
'colgroup',
'command',
'embed',
'hr',
'img',
'input',
'keygen',
'link',
'meta',
'param',
'source',
'track',
'wbr'];

// Every empty node. There is no reason to match nodes with content inside.
$query = '//*[not(node())]';
$nodes = (new DOMXPath($dom))->query($query);

foreach ($nodes as $n) {
if (! in_array($n->nodeName, $voids)) {
// If it is not a void/empty tag,
// we need to leave the tag open.
$n->appendChild(new DOMComment('NOT_VOID'));
}
}

// Let's remove the placeholder.
return str_replace('<!--NOT_VOID-->', '', $dom->saveXML());
}

In your example

$dom = new DOMDocument();
$dom->loadXML(<<<XML
<html>
<textarea id="something"></textarea>
<div id="someDiv" class="whaever"></div>
</html>
XML
);

echo export_html($dom); will produce

<?xml version="1.0"?>
<html>
<textarea id="something"></textarea>
<div id="someDiv" class="whaever"></div>
</html>

DOMDocument escaping end chars in PHP

Please see the answer provided by this question: Why doesn't PHP DOM include slash on self closing tags?

In short, DOMDocument->saveHTMLFile() outputs its internal structure as regular old HTML instead of XHTML. If you absolutely need XHTML, you can use DOMDocument->saveXMLFile() which will use self-closing tags. The only problem with this method is some HTML tags cannot use self-closing tags like <script> and <style> so you have to put a space in their content so that they don't use self-closing tags.

I would recommend just ignoring the issue unless it is mandatory that you fix it. Self-closing tags are a relic of XHTML and are unused in HTML5.

Is writing self closing tags for elements not traditionally empty bad practice?

I'm assuming your question has to do with the red trailing slash on self-closing elements when you view source in Firefox. If so, you've stumbled into one of the most vehement, yet simultaneously passive aggressive debates in the browser maker vs. web developer wars. XHTML is NOT just about a document's markup. It's also about how documents are meant to be served over the web.

Before I begin; I'm trying hard not to take sides here.

The XHTML 1.1 spec says that a web server should serve XHTML with a Content-Type of application/xhtml+xml. Firefox is singling out those trailing slashes as invalid because your document is being served as text/html rather than application/xhtml+xml. Take these two examples; identical markup, one served as application/xhtml+xml, the other as text/html.

http://alanstorm.com/testbed/xhtml-as-html.php

http://alanstorm.com/testbed/xhtml-as-xhtml.php

Firefox flags the trailing slash in the meta tag as invalid for the document served with text/html, and valid for the document served with application/xhtml+xml.

Why this is Controversial

To a browser developer, the point of XHTML is you can treat your document as XML, which means if someone sends you something that's not valid, the spec says you don't have to parse it. So, if a document is served as application/xhtml+xml and has non-well formed content, the developer is allowed to say "not my problem". You can see that in action here

http://alanstorm.com/testbed/xhtml-not-valid.php

When a document is served as text/html, Firefox treats it as a plain old HTML document and uses the forgiving, fix it for you, parsing routines

http://alanstorm.com/testbed/xhtml-not-valid-as-html.php

So, to a browser maker, XHTML served as text/html is ludicrous, because it's never treated as XML by the browser's rendering engine.

A bunch of years ago, web developers looking to be more than tag monkeys (Disclaimer: I include myself as one of them) started looking for ways to develop best practices that didn't involved thrice nested tables, but still allowed a compelling design experience. They/We latched onto XHTML/CSS, because the W3C said this was the future, and the only other choice was a world where a single vendor (Microsoft) controlled the defacto markup spec. The real evil there being the single vendor, and not so much Microsoft. I swear.

So where's the controversy? There are two problems with application/xhtml+xml. The first is Internet Explorer. There's a legacy bug/feature in IE where content served as application/xhtml+xml will prompt the user to download the document. If you tried to visit the xhtml-as-xhtml.php listed above with IE that's likely what happened. This means if you want to use application/xhtml+xml, you have to browser sniff for IE, check the Accepts header and only serve application/xhtml+xml to those browsers that accept it. This is not as trivial as it sounds to get right, and also went against the "write once" principle that the web developers were striving for.

The second problem is the harshness of XML. This is, again, one of those flame prone issues, but there's some people who think a single bad tag, or single character improperly encoded shouldn't result in a user not seeing the document they want to. In other words, yes, the spec says you should stop processing XML if it's not well formed, but the user doesn't care about the spec, they care that their cat's website is broken.

Adding even more gasoline to the issue is the XHTML 1.0 (not 1.1) spec says that XHTML documents may be served as text/html, assuming certain compatibility guidelines are followed. Things like the img tag being self closing and the like. The key word here is may. In RFC speak, may means optional. Firefox has chosen NOT to treat documents served with an XHTML doctype but a content type of text/html as XHTML. However, the W3C validator will happily report these documents as valid.

I'll leave the reader to ponder the simultaneous wonder/horror of a culture that writes a document to define what they mean by the word may.

Moving Forward

Finally, this is what the whole HTML 5 thing is about. XHTML became such a political hot potato that a bunch of people who wanted to move the language forward decided to go in another direction. They produced a spec for HTML 5. This is currently being hashed out in the W3C, and expected to finish sometime in the next decade. In the meantime, browser vendors are picking and choosing features from the in-progress spec and implementing them.

Updates from the Comments

In the comments, Alex points out that if you're going to sniff for something, you should check the Accept header to see if application/xhtml+xml is accepted by the user agent.

This is absolutely correct. In general, if you're going to sniff, sniff for the feature, not for the browser.

Why is a self-closing iframe tag preventing further DOM elements to be displayed?

Because the iframe element isn't a self-closing element. The versions of Firefox and Safari you're using are treating the /> at the end as just > and assuming everything after it is contained within the iframe.

If we attempt to pass the code you've given through W3C's validator we'll see the following errors:

Error: Self-closing syntax (/>) used on a non-void HTML element. Ignoring the slash and treating as a start tag.

<iframe src="http://www.bing.com"/>

Error: End of file seen when expecting text or an end tag.

</html>

Error: Unclosed element iframe.

<iframe src="http://www.bing.com"/>

If you inspect your document with your browser's Element Inspector, you'll see what's going on.

Chrome, which I'm using, converts the invalid <iframe ... /> to <iframe ...></iframe>:

Chrome Example

Closing HTML input tag issue

These are void elements. This means they aren't designed to contain text or other elements, and as such do not need — and in fact, cannot have — a closing tag in HTML.1

However, they should have a <label> associated with them:

<input id="my_id" type="radio" name="radio_name">
<label for="my_id">Radio Label</label>

Radio buttons by nature can't contain text anyway, so it wouldn't make sense for them to accept text or other elements as content. Another issue with a control that does accept text as input: should its textual content then be its value, or its label? To avoid ambiguity we have a <label> element that does exactly what it says on the tin, and we have a value attribute for denoting an input control's value.


1 XHTML is different; by XML rules, every tag must be opened and closed; this is done with the shortcut syntax instead of a </input> tag, although the latter is equally acceptable:

<input id="my_id" type="radio" name="radio_name" />
<label for="my_id">Radio Label</label>


Related Topics



Leave a reply



Submit