PHP Dom Textcontent VS Nodevalue

PHP DOM textContent vs nodeValue?

I finally wanted to know the difference as well, so I dug into the source and found the answer; in most cases there will be no discernible difference, but there are a bunch of edge cases you should be aware of.

Both ->nodeValue and ->textContent are identical for the following classes (node types):

  • DOMAttr
  • DOMText
  • DOMElement
  • DOMComment
  • DOMCharacterData
  • DOMProcessingInstruction

The ->nodeValue property yields NULL for the following classes (node types):

  • DOMDocumentFragment
  • DOMDocument
  • DOMNotation
  • DOMEntity
  • DOMEntityReference

The ->textContent property is non-existent for the following classes:

  • DOMNameSpaceNode (not documented, but can be found with //namespace:* selector)

The ->nodeValue property is non-existent for the following classes:

  • DOMDocumentType

See also: dom_node_node_value_read() and dom_node_text_content_read()

nodeValue when working with PHP DOM

You can recursively search into children of the node for DOMText node and join them with a whitespace:

function getNodeText(DOMNode $node) {
if (is_a($node, "DOMText"))
return trim($node->nodeValue);

$nodeValues = array();
foreach ($node->childNodes as $child)
{
$nodeText = getNodeText($child);
if ($nodeText != "")
{
$nodeValues[] = $nodeText;
}
}
return trim(implode(" ", $nodeValues));
}

function getHeadingTags($content) {
$dom = new DomDocument();
$dom->loadHTML($content);

$xpath = new DOMXPath($dom);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions();

$nodes = $xpath->query('//h1|//h2|//h3|//h4|//h5|//h6');

$results = array();

if ($nodes->length > 0)
{
foreach ($nodes as $node)
{
$results[$node->tagName][] = getNodeText($node);
}
}

return $results;
}

See an example here: https://3v4l.org/pYMFr

PHP DOM get nodevalue html? (without stripping tags)

I have never done what you're attempting to do, but as a stab in the dark, using the API docs, does echo $entry->textContent; work?

Adding an update. This is from the comments located on the docs page for DOMNode:

Hi!

Combining all th comments, the easiest way to get inner HTML of the node is to use this function:

<?php  function get_inner_html( $node ) { 
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}

return $innerHTML; } ?>

Or, maybe a simpler method is just to do:

echo $domDocument->saveXML($entry);

Xpath nodeValue/textContent unable to see BR tag

Maybe this'll help you: DOMNode::C14N

It'll return the HTML of the node.

<?php
$a = '<a href="#">ABC<BR>DEF</a>';
$doc = new DOMDocument();
@$doc->loadHTML($a);
$finder = new DomXPath($doc);
$nodes = $finder->query("//a");
foreach ($nodes as $node) {
var_dump($node->c14n());
}

Demo

DOM in PHP: Decoded entities and setting nodeValue

As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText and DOMElement differ in this regard.
To illustrate this, an example:

$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML('<root/>');

$s = 'text &<<"\'&text;&text';

$root = $doc->documentElement;

$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);

$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);

$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);

echo $doc->saveXML();

outputs

Warning: DOMDocument::createElement(): unterminated entity reference            text in /tmp/DOMtest.php on line 10
<?xml version="1.0"?>
<root>
<tag1>text &<<"'&text;</tag1>
<tag2>text &amp;&lt;<"'&text;&text</tag2>
<tag3><![CDATA[text &<<"'&text;&text]]></tag3>
</root>

In this particular case, it is appropriate to alter the nodeValue of DOMText nodes. Combining hakre's two answers one gets a quite elegant solution.

$doc = new DOMDocument();
$doc->loadXML(<XML data>);

$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);

$visitTextNode = function (DOMText $node) {
$text = $node->textContent;
/*
do something with $text
*/
$node->nodeValue = $text;
};

foreach ($node_list as $node) {
if ($node->nodeType == XML_TEXT_NODE) {
$visitTextNode($node);
} else {
foreach ($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$visitTextNode($child);
}
}
}
}

Am I able to retrieve a node value without the content of its child nodes?

Inside each div element are DOMText nodes that contain the actual text. Instead of using $div->textContent, collect the nodeValues of each child text node:

foreach($divs as $div){
$text = array();

foreach ($div->childNodes as $childNode) {
if ($childNode->nodeType === XML_TEXT_NODE && $childNode->nodeValue) {
$text[] = trim($childNode->nodeValue);
}
}

if ($text) {
print implode(' ', $text) . '<br>';
}
}

PHP Dom Documents: getting textContent ignoring script tags and comments

You have to visit all nodes and return their text. If some contain other node, visit them too.

This can be done with this basic recursive algorithm:

extractNode:
if node is a text node or a cdata node, return its text
if is an element node or a document node or a document fragment node:
if it’s a script node, return an empty string
return a concatenation of the result of calling extractNode on all the child nodes
for everything else return nothing

Implementation:

function extractText($node) {    
if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
return $node->nodeValue;
} else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
if ('script' === $node->nodeName) return '';

$text = '';
foreach($node->childNodes as $childNode) {
$text .= extractText($childNode);
}
return $text;
}
}

This will return the textContent of the given $node, ignoring script tags and comments.

$words = htmlspecialchars(extractText($bodyNodes->item(0)));

Try it here: http://codepad.org/CS3nMp7U



Related Topics



Leave a reply



Submit