PHP: How to Handle ≪![Cdata[ With Simplexmlelement

PHP: How to handle <![CDATA[ with SimpleXMLElement?

You're probably not accessing it correctly. You can output it directly or cast it as a string. (in this example, the casting is superfluous, as echo automatically does it anyway)

$content = simplexml_load_string(
'<content><![CDATA[Hello, world!]]></content>'
);
echo (string) $content;

// or with parent element:

$foo = simplexml_load_string(
'<foo><content><![CDATA[Hello, world!]]></content></foo>'
);
echo (string) $foo->content;

You might have better luck with LIBXML_NOCDATA:

$content = simplexml_load_string(
'<content><![CDATA[Hello, world!]]></content>'
, null
, LIBXML_NOCDATA
);

How to write CDATA using SimpleXmlElement?

Got it! I adapted the code from this great solution (archived version):

    <?php

// http://coffeerings.posterous.com/php-simplexml-and-cdata
class SimpleXMLExtended extends SimpleXMLElement {

public function addCData( $cdata_text ) {
$node = dom_import_simplexml( $this );
$no = $node->ownerDocument;

$node->appendChild( $no->createCDATASection( $cdata_text ) );
}

}

$xmlFile = 'config.xml';

// instead of $xml = new SimpleXMLElement( '<site/>' );
$xml = new SimpleXMLExtended( '<site/>' );

$xml->title = NULL; // VERY IMPORTANT! We need a node where to append

$xml->title->addCData( 'Site Title' );
$xml->title->addAttribute( 'lang', 'en' );

$xml->saveXML( $xmlFile );

?>

XML file generated:

    <?xml version="1.0"?>
<site>
<title lang="en"><![CDATA[Site Title]]></title>
</site>

Thank you Petah

Reading text in `<![CDATA[...]]>` with SimpleXMLElement

SimpleXML reads CDATA nodes absolutely fine. The only problem you're having is that print_r, var_dump, and similar functions don't give an accurate representation of SimpleXML objects, because they are not implemented fully in PHP.

If you run echo $myNode->description you will see the content of the CDATA section just fine. The reason is that when you ask for a SimpleXMLElement to be converted to a string, it automatically combines all the text and CDATA content for you - but until you do, it remembers the distinction.

As a general case, to extract the string content of any element or attribute in SimpleXML, cast to string with (string)$myNode. This also prevents other issues, such as functions complaining about getting an object when they were expecting a string, or failure to serialize when saving to a session.

See also my previous answer at https://stackoverflow.com/a/13830559/157957

simpleXML get value from CDATA

In your simplexml_load_file(), you need to add the parameter LIBXML_NOCDATA flag:

$url = "http://www.ss.lv/lv/real-estate/flats/riga/hand_over/rss/";
$result = simplexml_load_file($url, 'SimpleXMLElement', LIBXML_NOCDATA);
// ^^ here
foreach($result->channel->item as $item) {
$title = (string) $item->title;
$desc = (string) $item->description;
$dom = new DOMDocument($desc);
$dom->loadHTML($desc);
$bold_tags = $dom->getElementsByTagName('b');
foreach($bold_tags as $b) {
echo $b->nodeValue . '<br/>';
}
}

PHP, SimpleXML, decoding entities in CDATA

The purpose of CDATA sections in XML is to encapsulate a block of text "as is" which would otherwise require special characters (in particular, >, < and &) to be escaped. A CDATA section containing the character & is the same as a normal text node containing &.

If a parser were to offer to ignore this, and pretend all CDATA nodes were really just text nodes, it would instantly break as soon as someone mentioned "P&O Cruises" - that & simply can't be there on its own (rather than as &, or &somethingElse;).

The LIBXML_NOCDATA is actually pretty useless with SimpleXML, because (string)$foo neatly combines any sequence of text and CDATA nodes into an ordinary PHP string. (Something which people frequently fail to notice, because print_r doesn't.) This isn't necessarily true of more systematic access methods, such as DOM, where you can manipulate text nodes and CDATA nodes as objects in their own right.

What it effectively does is go through the document, and wherever it encounters a CDATA section, it takes the content, escapes it, and puts it back as an ordinary text node, or "merges" it with any text nodes to either side. The text represented is identical, just stored in the document in a different way; you can see the difference if you export back to XML, as in this example:

$xml_string = "<person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>";

$person = new SimpleXMLElement($xml_string);
echo 'CDATA retained: ', $person->asXML();
// CDATA retained: <?xml version="1.0"?>
// <person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>

$person = new SimpleXMLElement($xml_string, LIBXML_NOCDATA);
echo 'CDATA merged: ', $person->asXML();
// CDATA merged: <?xml version="1.0"?>
// <person><name>Welcome aboard this P&O Cruises voyage!</name></person>

If the XML document you're parsing contains a CDATA section which actually contains entities, you need to take that string and unescape it completely independent of the XML. One common reason to do this (other than laziness with poorly understood libraries) is to treat something marked up in HTML as just any old string inside an XML document, like this:

<Comment>
<SubmittedBy>IMSoP</SubmittedBy>
<Text><![CDATA[I'm <em>really</em> bad at keeping my answers brief <tt>;)</tt>]]></Text>
</Comment>

SimpleXML: handle CDATA tag presence in node value

As far as a parser like SimpleXML is concerned, the <![CDATA[ is not part of the text content of the XML element, it's just part of the serialization of that content. A similar confusion is discussed here: PHP, SimpleXML, decoding entities in CDATA

What you need to look at is the "inner XML" of that element, which is tricky in SimpleXML (->asXML() will give you the "outer XML", e.g. <Dest><![CDATA[some text...]]></Dest>).

Your best bet here is to use the DOM which gives you more access to the detailed structure of the document, rather than trying to give you the content, so distinguishes "text nodes" and "CDATA nodes". However, it's worth double-checking that you do actually need this, as for 99.9% of use cases, you shouldn't care whether somebody sent you <foo>bar & baz</foo> or <foo><![CDATA[bar & baz]]></foo>, since by definition they represent the same string.

How to parse CDATA HTML-content of XML using SimpleXML?

I once answered it but I don't find the answer any longer.

If you take a look at the string (simplified/beautified):

<content:encoded><![CDATA[
<p>Lorem Ipsom</p>
<p>
<a href='laura-bertram-trance-gemini-145-1080.jpg'
title='<br>November 2012 calendar from 5.10 The Test<br> <a href="</a>
</p>]]>
</content:encoded>

You can see that you have HTML encoded inside the node-value of the <content:encoded> element. So first you need to obtain the HTML value, which you already do:

$html = $boo->children('content', true)->encoded;

Then you need to parse the HTML inside $html. With which libraries HTML parsing can be done with PHP is outlined in:

  • How to parse and process HTML/XML with PHP?

If you decide to use the more or less recommended DOMDocument for the job, you only need to get the attribute value of a certain element:

  • PHP DOMDocument getting Attribute of Tag

Or for its sister library SimpleXML you already use (so this is more recommended, see as well the next section):

  • How to get an attribute with SimpleXML?

In context of your question here the following tip:

You're using SimpleXML. DOMDocument is a sister-library, meaning you can interchange between the two so you don't need to learn a full new library.

For example, you can use only the HTML parsing feature of DOMDocument, but import it then into SimpleXML. This is useful, because SimpleXML does not support HTML parsing.

That works via simplexml_import_dom().

A simplified step-by-step example:

// get the HTML string out of the feed:
$htmlString = $boo->children('content', true)->encoded;

// create DOMDocument for HTML parsing:
$htmlParser = new DOMDocument();

// load the HTML:
$htmlParser->loadHTML($htmlString);

// import it into simplexml:
$html = simplexml_import_dom($htmlParser);

Now you can use $html as a new SimpleXMLElement that represents the HTML document. As your HTML chunks did not have any <body> tags, according to the HTML specification, they are put inside the <body> tag. This will allow you for example to access the href attribute of the first <a> inside the second <p> element in your example:#

// access the element you're looking for:
$href = $html->body->p[1]->a['href'];

Here the full view from above (Online Demo):

// get the HTML string out of the feed:
$htmlString = $boo->children('content', true)->encoded;

// create DOMDocument for HTML parsing:
$htmlParser = new DOMDocument();

// your HTML gives parser warnings, keep them internal:
libxml_use_internal_errors(true);

// load the HTML:
$htmlParser->loadHTML($htmlString);

// import it into simplexml:
$html = simplexml_import_dom($htmlParser);

// access the element you're looking for:
$href = $html->body->p[1]->a['href'];

// output it
echo $href, "\n";

And what it outputs:

laura-bertram-trance-gemini-145-1080.jpg

Modify ![CDATA[]] in PHP? (XML)

That is true for SimpleXML. CDATA Sections are a special kind of text nodes. They are actually here to make embedded parts more readable for humans. SimpleXML does not really handle XML nodes so you will have to let it convert them to standard text nodes.

If you have a JS or HTML fragment in XML it is easier to read if the special characters like < are not escaped. And this is what CDATA sections are for (and some backwards compatibility for browsers).

So to modify a CDATA section and keep it, you will have to use DOM. DOM actually knows about the different node types. Here is a small example:

$xml = '<link><![CDATA[https://google.de]]></link>';

$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);

foreach ($xpath->evaluate('//link/text()') as $linkValue) {
$linkValue->data .= '?abc';
}
echo $document->saveXml();

Output:

<?xml version="1.0"?>
<link><![CDATA[https://google.de?abc]]></link>

Getting cdata content while parsing xml file

SimpleXML has a bit of a problem with CDATA, so use:

$xml = simplexml_load_file('xmlfile', 'SimpleXMLElement', LIBXML_NOCDATA);
if(!empty($xml))
{
$nodes = $xml->xpath('//xml/events');
}
print_r( $nodes );

This will give you:

Array
(
[0] => SimpleXMLElement Object
(
[@attributes] => Array
(
[date] => 01-10-2009
[color] => 0x99CC00
[selected] => true
)

[event] => SimpleXMLElement Object
(
[title] => You can use HTML and CSS
[description] => This is the description
)

)

)

PHP - Parse XML with simplexml_load_string - Getting empty values with CDATA?

The problem seems to be that the cast you're doing to array can return results different than the actual structure of the XML object.

Something like the following code should give you an array with the correct info:

$array = array_map('strval', (array) $xml->Product);

Take care you cast those parts to string of which you'll get the data from (in the example done via strval()). In the opposite, json_encode() is not working well with SimpleXMLElement.



Related Topics



Leave a reply



Submit