Php, Simplexml, Decoding Entities in Cdata

PHP, SimpleXML, decoding entities in CDATA

The purpose of CDATA sections in XML is to encapsulate a block of text "as is" which would otherwise require special characters (in particular, >, < and &) to be escaped. A CDATA section containing the character & is the same as a normal text node containing &.

If a parser were to offer to ignore this, and pretend all CDATA nodes were really just text nodes, it would instantly break as soon as someone mentioned "P&O Cruises" - that & simply can't be there on its own (rather than as &, or &somethingElse;).

The LIBXML_NOCDATA is actually pretty useless with SimpleXML, because (string)$foo neatly combines any sequence of text and CDATA nodes into an ordinary PHP string. (Something which people frequently fail to notice, because print_r doesn't.) This isn't necessarily true of more systematic access methods, such as DOM, where you can manipulate text nodes and CDATA nodes as objects in their own right.

What it effectively does is go through the document, and wherever it encounters a CDATA section, it takes the content, escapes it, and puts it back as an ordinary text node, or "merges" it with any text nodes to either side. The text represented is identical, just stored in the document in a different way; you can see the difference if you export back to XML, as in this example:

$xml_string = "<person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>";

$person = new SimpleXMLElement($xml_string);
echo 'CDATA retained: ', $person->asXML();
// CDATA retained: <?xml version="1.0"?>
// <person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>

$person = new SimpleXMLElement($xml_string, LIBXML_NOCDATA);
echo 'CDATA merged: ', $person->asXML();
// CDATA merged: <?xml version="1.0"?>
// <person><name>Welcome aboard this P&O Cruises voyage!</name></person>

If the XML document you're parsing contains a CDATA section which actually contains entities, you need to take that string and unescape it completely independent of the XML. One common reason to do this (other than laziness with poorly understood libraries) is to treat something marked up in HTML as just any old string inside an XML document, like this:

<Comment>
<SubmittedBy>IMSoP</SubmittedBy>
<Text><![CDATA[I'm <em>really</em> bad at keeping my answers brief <tt>;)</tt>]]></Text>
</Comment>

SimpleXML, CDATA and HTML entities

Actually, this seemed to be exactly what I needed to do:

How to keep DOMDocument from saving < as <

Although pouring over the manual for the DOM api has given me something new I'd like to learn for future use.

SimpleXML: handle CDATA tag presence in node value

As far as a parser like SimpleXML is concerned, the <![CDATA[ is not part of the text content of the XML element, it's just part of the serialization of that content. A similar confusion is discussed here: PHP, SimpleXML, decoding entities in CDATA

What you need to look at is the "inner XML" of that element, which is tricky in SimpleXML (->asXML() will give you the "outer XML", e.g. <Dest><![CDATA[some text...]]></Dest>).

Your best bet here is to use the DOM which gives you more access to the detailed structure of the document, rather than trying to give you the content, so distinguishes "text nodes" and "CDATA nodes". However, it's worth double-checking that you do actually need this, as for 99.9% of use cases, you shouldn't care whether somebody sent you <foo>bar & baz</foo> or <foo><![CDATA[bar & baz]]></foo>, since by definition they represent the same string.

How to parse CDATA HTML-content of XML using SimpleXML?

I once answered it but I don't find the answer any longer.

If you take a look at the string (simplified/beautified):

<content:encoded><![CDATA[
<p>Lorem Ipsom</p>
<p>
<a href='laura-bertram-trance-gemini-145-1080.jpg'
title='<br>November 2012 calendar from 5.10 The Test<br> <a href="</a>
</p>]]>
</content:encoded>

You can see that you have HTML encoded inside the node-value of the <content:encoded> element. So first you need to obtain the HTML value, which you already do:

$html = $boo->children('content', true)->encoded;

Then you need to parse the HTML inside $html. With which libraries HTML parsing can be done with PHP is outlined in:

  • How to parse and process HTML/XML with PHP?

If you decide to use the more or less recommended DOMDocument for the job, you only need to get the attribute value of a certain element:

  • PHP DOMDocument getting Attribute of Tag

Or for its sister library SimpleXML you already use (so this is more recommended, see as well the next section):

  • How to get an attribute with SimpleXML?

In context of your question here the following tip:

You're using SimpleXML. DOMDocument is a sister-library, meaning you can interchange between the two so you don't need to learn a full new library.

For example, you can use only the HTML parsing feature of DOMDocument, but import it then into SimpleXML. This is useful, because SimpleXML does not support HTML parsing.

That works via simplexml_import_dom().

A simplified step-by-step example:

// get the HTML string out of the feed:
$htmlString = $boo->children('content', true)->encoded;

// create DOMDocument for HTML parsing:
$htmlParser = new DOMDocument();

// load the HTML:
$htmlParser->loadHTML($htmlString);

// import it into simplexml:
$html = simplexml_import_dom($htmlParser);

Now you can use $html as a new SimpleXMLElement that represents the HTML document. As your HTML chunks did not have any <body> tags, according to the HTML specification, they are put inside the <body> tag. This will allow you for example to access the href attribute of the first <a> inside the second <p> element in your example:#

// access the element you're looking for:
$href = $html->body->p[1]->a['href'];

Here the full view from above (Online Demo):

// get the HTML string out of the feed:
$htmlString = $boo->children('content', true)->encoded;

// create DOMDocument for HTML parsing:
$htmlParser = new DOMDocument();

// your HTML gives parser warnings, keep them internal:
libxml_use_internal_errors(true);

// load the HTML:
$htmlParser->loadHTML($htmlString);

// import it into simplexml:
$html = simplexml_import_dom($htmlParser);

// access the element you're looking for:
$href = $html->body->p[1]->a['href'];

// output it
echo $href, "\n";

And what it outputs:

laura-bertram-trance-gemini-145-1080.jpg

Get a special charactor in a XML file using simplexml and CDATA

You have an exact question like this SO PHP, SimpleXML, decoding entities in CDATA

It has explain about CDATA which can be used in case of

special characters (in particular, >, < and &) to be escaped. A CDATA
section containing the character & is the same as a normal text node
containing &.

Have a look on this.

Edit

Here is one solution if you load every time your example.php code into string then below is one solution.

  $xmlstr = str_replace("&","&",$xmlstr);
$xml = simplexml_load_string($xmlstr);
$abc = $xml->Book[0]->title[0];
echo $abc;

PHP: How to handle ![CDATA[ with SimpleXMLElement?

You're probably not accessing it correctly. You can output it directly or cast it as a string. (in this example, the casting is superfluous, as echo automatically does it anyway)

$content = simplexml_load_string(
'<content><![CDATA[Hello, world!]]></content>'
);
echo (string) $content;

// or with parent element:

$foo = simplexml_load_string(
'<foo><content><![CDATA[Hello, world!]]></content></foo>'
);
echo (string) $foo->content;

You might have better luck with LIBXML_NOCDATA:

$content = simplexml_load_string(
'<content><![CDATA[Hello, world!]]></content>'
, null
, LIBXML_NOCDATA
);

Parsing xml feed with cdata PHP SimpleXML

I think you are just the victim of the browser hiding the tags. Let me explain:
Your input feed doesn't really has <![CDATA[ ]]> tags in them, the < and >s are actually entity encoded in the raw source of the rss stream, hit ctrl+u on the rss link in your browser and you will see:

<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">
<channel>
<description>Blog do Garotinho</description>
<item>
<description><![CDATA[<br>
Fico impressionado com a hipocrisia e a falsidade de certos políticos....]]>
</description>
<link><![CDATA[http://www.blogdogarotinho.com.br/lartigo.aspx?id=16796]]></link>
...
<title><![CDATA[A bancada dos caras de pau]]></title>
</item>

As you can see the <title> for example starts with a < which when will turn to a < when simplexml returns it for your json data.
Now if you are looking the printed json data in a browser your browser will see the following:

"title":"<![CDATA[A bancada dos caras de pau]]>"

Which will will not be rendered because it's inside a tag. The description seem to show up because it has a <br> tag in it at some point which ends the first "tag" and thus you can see the rest of the output.

If you hit ctrl+u you should see the output printed as expected (i myself used a command line php file and did not notice this first).

Try this demo:

  • There seem to be empty an empty "" after the "title":

    http://codepad.viper-7.com/ZYpaS1
  • However if i put a htmlspecialchars() around the json_encode():

    http://codepad.viper-7.com/1nHqym they became "visible".

You could try to get rid of these by simply replacing them out after the parse with a simple preg_replace():

function clean_cdata($str) {
return preg_replace('#(^\s*<!\[CDATA\[|\]\]>\s*$)#sim', '', (string)$str);
}

This should take care of the CDATA blocks if they are at the start or the end of the individual tags. You can throw call this inside the foreach() loop like this:

// ....
$article['title'] = clean_cdata($item->title);
// ....

Getting cdata content while parsing xml file

SimpleXML has a bit of a problem with CDATA, so use:

$xml = simplexml_load_file('xmlfile', 'SimpleXMLElement', LIBXML_NOCDATA);
if(!empty($xml))
{
$nodes = $xml->xpath('//xml/events');
}
print_r( $nodes );

This will give you:

Array
(
[0] => SimpleXMLElement Object
(
[@attributes] => Array
(
[date] => 01-10-2009
[color] => 0x99CC00
[selected] => true
)

[event] => SimpleXMLElement Object
(
[title] => You can use HTML and CSS
[description] => This is the description
)

)

)


Related Topics



Leave a reply



Submit