How to Parse Cdata HTML-Content of Xml Using Simplexml

How to parse CDATA HTML-content of XML using SimpleXML?

I once answered it but I don't find the answer any longer.

If you take a look at the string (simplified/beautified):

<content:encoded><![CDATA[
<p>Lorem Ipsom</p>
<p>
<a href='laura-bertram-trance-gemini-145-1080.jpg'
title='<br>November 2012 calendar from 5.10 The Test<br> <a href="</a>
</p>]]>
</content:encoded>

You can see that you have HTML encoded inside the node-value of the <content:encoded> element. So first you need to obtain the HTML value, which you already do:

$html = $boo->children('content', true)->encoded;

Then you need to parse the HTML inside $html. With which libraries HTML parsing can be done with PHP is outlined in:

  • How to parse and process HTML/XML with PHP?

If you decide to use the more or less recommended DOMDocument for the job, you only need to get the attribute value of a certain element:

  • PHP DOMDocument getting Attribute of Tag

Or for its sister library SimpleXML you already use (so this is more recommended, see as well the next section):

  • How to get an attribute with SimpleXML?

In context of your question here the following tip:

You're using SimpleXML. DOMDocument is a sister-library, meaning you can interchange between the two so you don't need to learn a full new library.

For example, you can use only the HTML parsing feature of DOMDocument, but import it then into SimpleXML. This is useful, because SimpleXML does not support HTML parsing.

That works via simplexml_import_dom().

A simplified step-by-step example:

// get the HTML string out of the feed:
$htmlString = $boo->children('content', true)->encoded;

// create DOMDocument for HTML parsing:
$htmlParser = new DOMDocument();

// load the HTML:
$htmlParser->loadHTML($htmlString);

// import it into simplexml:
$html = simplexml_import_dom($htmlParser);

Now you can use $html as a new SimpleXMLElement that represents the HTML document. As your HTML chunks did not have any <body> tags, according to the HTML specification, they are put inside the <body> tag. This will allow you for example to access the href attribute of the first <a> inside the second <p> element in your example:#

// access the element you're looking for:
$href = $html->body->p[1]->a['href'];

Here the full view from above (Online Demo):

// get the HTML string out of the feed:
$htmlString = $boo->children('content', true)->encoded;

// create DOMDocument for HTML parsing:
$htmlParser = new DOMDocument();

// your HTML gives parser warnings, keep them internal:
libxml_use_internal_errors(true);

// load the HTML:
$htmlParser->loadHTML($htmlString);

// import it into simplexml:
$html = simplexml_import_dom($htmlParser);

// access the element you're looking for:
$href = $html->body->p[1]->a['href'];

// output it
echo $href, "\n";

And what it outputs:

laura-bertram-trance-gemini-145-1080.jpg

Parsing xml feed with cdata PHP SimpleXML

I think you are just the victim of the browser hiding the tags. Let me explain:
Your input feed doesn't really has <![CDATA[ ]]> tags in them, the < and >s are actually entity encoded in the raw source of the rss stream, hit ctrl+u on the rss link in your browser and you will see:

<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" version="2.0">
<channel>
<description>Blog do Garotinho</description>
<item>
<description><![CDATA[<br>
Fico impressionado com a hipocrisia e a falsidade de certos políticos....]]>
</description>
<link><![CDATA[http://www.blogdogarotinho.com.br/lartigo.aspx?id=16796]]></link>
...
<title><![CDATA[A bancada dos caras de pau]]></title>
</item>

As you can see the <title> for example starts with a < which when will turn to a < when simplexml returns it for your json data.
Now if you are looking the printed json data in a browser your browser will see the following:

"title":"<![CDATA[A bancada dos caras de pau]]>"

Which will will not be rendered because it's inside a tag. The description seem to show up because it has a <br> tag in it at some point which ends the first "tag" and thus you can see the rest of the output.

If you hit ctrl+u you should see the output printed as expected (i myself used a command line php file and did not notice this first).

Try this demo:

  • There seem to be empty an empty "" after the "title":

    http://codepad.viper-7.com/ZYpaS1
  • However if i put a htmlspecialchars() around the json_encode():

    http://codepad.viper-7.com/1nHqym they became "visible".

You could try to get rid of these by simply replacing them out after the parse with a simple preg_replace():

function clean_cdata($str) {
return preg_replace('#(^\s*<!\[CDATA\[|\]\]>\s*$)#sim', '', (string)$str);
}

This should take care of the CDATA blocks if they are at the start or the end of the individual tags. You can throw call this inside the foreach() loop like this:

// ....
$article['title'] = clean_cdata($item->title);
// ....

PHP, SimpleXML, decoding entities in CDATA

The purpose of CDATA sections in XML is to encapsulate a block of text "as is" which would otherwise require special characters (in particular, >, < and &) to be escaped. A CDATA section containing the character & is the same as a normal text node containing &.

If a parser were to offer to ignore this, and pretend all CDATA nodes were really just text nodes, it would instantly break as soon as someone mentioned "P&O Cruises" - that & simply can't be there on its own (rather than as &, or &somethingElse;).

The LIBXML_NOCDATA is actually pretty useless with SimpleXML, because (string)$foo neatly combines any sequence of text and CDATA nodes into an ordinary PHP string. (Something which people frequently fail to notice, because print_r doesn't.) This isn't necessarily true of more systematic access methods, such as DOM, where you can manipulate text nodes and CDATA nodes as objects in their own right.

What it effectively does is go through the document, and wherever it encounters a CDATA section, it takes the content, escapes it, and puts it back as an ordinary text node, or "merges" it with any text nodes to either side. The text represented is identical, just stored in the document in a different way; you can see the difference if you export back to XML, as in this example:

$xml_string = "<person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>";

$person = new SimpleXMLElement($xml_string);
echo 'CDATA retained: ', $person->asXML();
// CDATA retained: <?xml version="1.0"?>
// <person><name>Welcome aboard this <![CDATA[P&O Cruises]]> voyage!</name></person>

$person = new SimpleXMLElement($xml_string, LIBXML_NOCDATA);
echo 'CDATA merged: ', $person->asXML();
// CDATA merged: <?xml version="1.0"?>
// <person><name>Welcome aboard this P&O Cruises voyage!</name></person>

If the XML document you're parsing contains a CDATA section which actually contains entities, you need to take that string and unescape it completely independent of the XML. One common reason to do this (other than laziness with poorly understood libraries) is to treat something marked up in HTML as just any old string inside an XML document, like this:

<Comment>
<SubmittedBy>IMSoP</SubmittedBy>
<Text><![CDATA[I'm <em>really</em> bad at keeping my answers brief <tt>;)</tt>]]></Text>
</Comment>

PHP: How to handle ![CDATA[ with SimpleXMLElement?

You're probably not accessing it correctly. You can output it directly or cast it as a string. (in this example, the casting is superfluous, as echo automatically does it anyway)

$content = simplexml_load_string(
'<content><![CDATA[Hello, world!]]></content>'
);
echo (string) $content;

// or with parent element:

$foo = simplexml_load_string(
'<foo><content><![CDATA[Hello, world!]]></content></foo>'
);
echo (string) $foo->content;

You might have better luck with LIBXML_NOCDATA:

$content = simplexml_load_string(
'<content><![CDATA[Hello, world!]]></content>'
, null
, LIBXML_NOCDATA
);

SimpleXML, CDATA and HTML entities

Actually, this seemed to be exactly what I needed to do:

How to keep DOMDocument from saving < as <

Although pouring over the manual for the DOM api has given me something new I'd like to learn for future use.

Getting cdata content while parsing xml file

SimpleXML has a bit of a problem with CDATA, so use:

$xml = simplexml_load_file('xmlfile', 'SimpleXMLElement', LIBXML_NOCDATA);
if(!empty($xml))
{
$nodes = $xml->xpath('//xml/events');
}
print_r( $nodes );

This will give you:

Array
(
[0] => SimpleXMLElement Object
(
[@attributes] => Array
(
[date] => 01-10-2009
[color] => 0x99CC00
[selected] => true
)

[event] => SimpleXMLElement Object
(
[title] => You can use HTML and CSS
[description] => This is the description
)

)

)

simpleXML get value from CDATA

In your simplexml_load_file(), you need to add the parameter LIBXML_NOCDATA flag:

$url = "http://www.ss.lv/lv/real-estate/flats/riga/hand_over/rss/";
$result = simplexml_load_file($url, 'SimpleXMLElement', LIBXML_NOCDATA);
// ^^ here
foreach($result->channel->item as $item) {
$title = (string) $item->title;
$desc = (string) $item->description;
$dom = new DOMDocument($desc);
$dom->loadHTML($desc);
$bold_tags = $dom->getElementsByTagName('b');
foreach($bold_tags as $b) {
echo $b->nodeValue . '<br/>';
}
}

Reading data from a xml file inside HTML CDATA with PHP

The <![CDATA[sample content]]> should be encased in a opening and a closing tag , only then the data can be retrieved. Also, to read the CDATA content , you should use the LIBXML_NOCDATA parameter.

Since those CDATA did not have any proper encasement you were getting the empty array.

The fixed code..

<?php

$content = '<?xml version="1.0" encoding="UTF-8"?>
<body>
<![CDATA[sample content]]><br />
<![CDATA[more content]]><br />
<![CDATA[content]]><br /></body>';

$content = str_replace(array('<br />','<!',']>'),array('','<br><!',']></br>'),$content);
$xml = simplexml_load_string($content, 'SimpleXMLElement', LIBXML_NOCDATA | LIBXML_NOBLANKS);
print_r($xml);

OUTPUT:

SimpleXMLElement Object
(
[br] => Array
(
[0] => sample content
[1] => more content
[2] => content
)

)

Retrieving CDATA contents from XML using PHP and simplexml

When you load the XML file, you'll need to handle the CDATA.. This example works:

<?php
$xml = simplexml_load_file('file.xml', NULL, LIBXML_NOCDATA);
$description = $xml->xpath("//item[@title='0x|Beschrijving']");
var_dump($description);
?>

Here's the output:

array(1) {
[0]=>
object(SimpleXMLElement)#2 (2) {
["@attributes"]=>
array(4) {
["id"]=>
string(15) "787900813228567"
["view"]=>
string(5) "12000"
["title"]=>
string(15) "0x|Beschrijving"
["engtitle"]=>
string(14) "0x|Description"
}
[0]=>
string(41) "Dit college leert studenten hoe ze een on"
}
}


Related Topics



Leave a reply



Submit