PHP DOM textContent vs nodeValue?
I finally wanted to know the difference as well, so I dug into the source and found the answer; in most cases there will be no discernible difference, but there are a bunch of edge cases you should be aware of.
Both ->nodeValue
and ->textContent
are identical for the following classes (node types):
DOMAttr
DOMText
DOMElement
DOMComment
DOMCharacterData
DOMProcessingInstruction
->nodeValue
property yields NULL
for the following classes (node types):DOMDocumentFragment
DOMDocument
DOMNotation
DOMEntity
DOMEntityReference
->textContent
property is non-existent for the following classes:DOMNameSpaceNode
(not documented, but can be found with//namespace:*
selector)
->nodeValue
property is non-existent for the following classes:DOMDocumentType
dom_node_node_value_read()
and dom_node_text_content_read()
nodeValue when working with PHP DOM
You can recursively search into children of the node for DOMText node and join them with a whitespace:
function getNodeText(DOMNode $node) {
if (is_a($node, "DOMText"))
return trim($node->nodeValue);
$nodeValues = array();
foreach ($node->childNodes as $child)
{
$nodeText = getNodeText($child);
if ($nodeText != "")
{
$nodeValues[] = $nodeText;
}
}
return trim(implode(" ", $nodeValues));
}
function getHeadingTags($content) {
$dom = new DomDocument();
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPHPFunctions();
$nodes = $xpath->query('//h1|//h2|//h3|//h4|//h5|//h6');
$results = array();
if ($nodes->length > 0)
{
foreach ($nodes as $node)
{
$results[$node->tagName][] = getNodeText($node);
}
}
return $results;
}
See an example here: https://3v4l.org/pYMFr PHP DOM get nodevalue html? (without stripping tags)
I have never done what you're attempting to do, but as a stab in the dark, using the API docs, does echo $entry->textContent; work?
Adding an update. This is from the comments located on the docs page for DOMNode:
Hi!
Combining all th comments, the easiest way to get inner HTML of the node is to use this function:
<?php function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML; } ?>
Or, maybe a simpler method is just to do:echo $domDocument->saveXML($entry);
Xpath nodeValue/textContent unable to see BR tag
Maybe this'll help you: DOMNode::C14N
It'll return the HTML of the node.
<?php
$a = '<a href="#">ABC<BR>DEF</a>';
$doc = new DOMDocument();
@$doc->loadHTML($a);
$finder = new DomXPath($doc);
$nodes = $finder->query("//a");
foreach ($nodes as $node) {
var_dump($node->c14n());
}
Demo DOM in PHP: Decoded entities and setting nodeValue
As hakre explained, the problem is that in PHP's DOM library, the behaviour of setting nodeValue w.r.t. entities depends on the class of the node, in particular DOMText
and DOMElement
differ in this regard.
To illustrate this, an example:
$doc = new DOMDocument();
$doc->formatOutput = True;
$doc->loadXML('<root/>');
$s = 'text &<<"\'&text;&text';
$root = $doc->documentElement;
$node = $doc->createElement('tag1', $s); #line 10
$root->appendChild($node);
$node = $doc->createElement('tag2');
$text = $doc->createTextNode($s);
$node->appendChild($text);
$root->appendChild($node);
$node = $doc->createElement('tag3');
$text = $doc->createCDATASection($s);
$node->appendChild($text);
$root->appendChild($node);
echo $doc->saveXML();
outputsWarning: DOMDocument::createElement(): unterminated entity reference text in /tmp/DOMtest.php on line 10
<?xml version="1.0"?>
<root>
<tag1>text &<<"'&text;</tag1>
<tag2>text &<<"'&text;&text</tag2>
<tag3><![CDATA[text &<<"'&text;&text]]></tag3>
</root>
In this particular case, it is appropriate to alter the nodeValue of
DOMText
nodes. Combining hakre's two answers one gets a quite elegant solution.$doc = new DOMDocument();
$doc->loadXML(<XML data>);
$xpath = new DOMXPath($doc);
$node_list = $xpath->query(<some XPath>);
$visitTextNode = function (DOMText $node) {
$text = $node->textContent;
/*
do something with $text
*/
$node->nodeValue = $text;
};
foreach ($node_list as $node) {
if ($node->nodeType == XML_TEXT_NODE) {
$visitTextNode($node);
} else {
foreach ($node->childNodes as $child) {
if ($child->nodeType == XML_TEXT_NODE) {
$visitTextNode($child);
}
}
}
}
Am I able to retrieve a node value without the content of its child nodes?
Inside each div
element are DOMText nodes that contain the actual text. Instead of using $div->textContent
, collect the nodeValues of each child text node:
foreach($divs as $div){
$text = array();
foreach ($div->childNodes as $childNode) {
if ($childNode->nodeType === XML_TEXT_NODE && $childNode->nodeValue) {
$text[] = trim($childNode->nodeValue);
}
}
if ($text) {
print implode(' ', $text) . '<br>';
}
}
PHP Dom Documents: getting textContent ignoring script tags and comments
You have to visit all nodes and return their text. If some contain other node, visit them too.
This can be done with this basic recursive algorithm:
extractNode:
if node is a text node or a cdata node, return its text
if is an element node or a document node or a document fragment node:
if it’s a script node, return an empty string
return a concatenation of the result of calling extractNode on all the child nodes
for everything else return nothing
Implementation:function extractText($node) {
if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
return $node->nodeValue;
} else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
if ('script' === $node->nodeName) return '';
$text = '';
foreach($node->childNodes as $childNode) {
$text .= extractText($childNode);
}
return $text;
}
}
This will return the textContent of the given $node, ignoring script tags and comments.$words = htmlspecialchars(extractText($bodyNodes->item(0)));
Try it here: http://codepad.org/CS3nMp7U
Related Topics
Checking If String Contains "Http://"
Facebook Graph API - Friends Using Application
Measure the Pronounceability of a Word
Magento Products by Categories
Calling a Stored Procedure from Codeigniter's Active Record Class
PHP Readfile() Causing Corrupt File Downloads
Convert Multidimensional Objects to Array
Difference Between $_Server['Document_Root'] and $_Server['Http_Host']
Does HTML_Entity_Decode Replaces &Nbsp; Also? If Not How to Replace It
How to Sort an Array by Similarity in Relation to an Inputted Word
Find Windows 32 or 64 Bit Using PHP