Testing Text() Nodes VS String Values in Xpath

Testing text() nodes vs string values in XPath

XPath text() = is different than XPath . =

(Matching text nodes is different than matching string values)

The following XPaths are not the same...

  1. //span[text() = 'Office Hours']

    Says:

    Select the span elements that
    have an immediate child
    text node equal to 'Office Hours`.

  2. //span[. = 'Office Hours']

    Says:

    Select the span elements whose string value is equal to 'Office Hours`.

In short, for element nodes:

The string-value of an element node is the concatenation of the
string-values of all text node descendants of the element
node in document order.

Examples

The following span elements would match only #1:

  • <span class="portal-text-medium">Office Hours<br/>8:00-10:00</span>
  • <span class="portal-text-medium">My<br/>Office Hours</span>

The following span elements would match only #2:

  • <span class="portal-text-medium"><b>Office</b> Hours</span>
  • <span class="portal-text-medium"><b><i>Office Hours</i></b></span>

The following span element would match both #1 and #2:

  • <span class="portal-text-medium">Office Hours</span>

XPath - Difference between node() and text()

text() and node() are node tests, in XPath terminology (compare).

Node tests operate on a set (on an axis, to be exact) of nodes and return the ones that are of a certain type. When no axis is mentioned, the child axis is assumed by default.

There are all kinds of node tests:

  • node() matches any node (the least specific node test of them all)
  • text() matches text nodes only
  • comment() matches comment nodes
  • * matches any element node
  • foo matches any element node named "foo"
  • processing-instruction() matches PI nodes (they look like <?name value?>).
  • Side note: The * also matches attribute nodes, but only along the attribute axis. @* is a shorthand for attribute::*. Attributes are not part of the child axis, that's why a normal * does not select them.

This XML document:

<produce>
<item>apple</item>
<item>banana</item>
<item>pepper</item>
</produce>

represents the following DOM (simplified):


root node
element node (name="produce")
text node (value="\n ")
element node (name="item")
text node (value="apple")
text node (value="\n ")
element node (name="item")
text node (value="banana")
text node (value="\n ")
element node (name="item")
text node (value="pepper")
text node (value="\n")

So with XPath:

  • / selects the root node
  • /produce selects a child element of the root node if it has the name "produce" (This is called the document element; it represents the document itself. Document element and root node are often confused, but they are not the same thing.)
  • /produce/node() selects any type of child node beneath /produce/ (i.e. all 7 children)
  • /produce/text() selects the 4 (!) whitespace-only text nodes
  • /produce/item[1] selects the first child element named "item"
  • /produce/item[1]/text() selects all child text nodes (there's only one - "apple" - in this case)

And so on.

So, your questions

  • "Select the text of all items under produce" /produce/item/text() (3 nodes selected)
  • "Select all the manager nodes in all departments" //department/manager (1 node selected)

Notes

  • The default axis in XPath is the child axis. You can change the axis by prefixing a different axis name. For example: //item/ancestor::produce
  • Element nodes have text values. When you evaluate an element node, its textual contents will be returned. In case of this example, /produce/item[1]/text() and string(/produce/item[1]) will be the same.
  • Also see this answer where I outline the individual parts of an XPath expression graphically.

Xpath OR over text node

This XPath,

//td[starts-with(., "In Care Of Name")]/text()

will return the immediate text node children of the td whose string value starts with In Care Of Name:

text that I want to catch

for both of your XML variations involving b and strong children of the td.

See Testing text() nodes vs string values in XPath for further details on the differences between text nodes and string values in XPath.

XPath: difference between dot and text()

There is a difference between . and text(), but this difference might not surface because of your input document.

If your input document looked like (the simplest document one can imagine given your XPath expressions)

Example 1

<html>
<a>Ask Question</a>
</html>

Then //a[text()="Ask Question"] and //a[.="Ask Question"] indeed return exactly the same result. But consider a different input document that looks like

Example 2

<html>
<a>Ask Question<other/>
</a>
</html>

where the a element also has a child element other that follows immediately after "Ask Question". Given this second input document, //a[text()="Ask Question"] still returns the a element, while //a[.="Ask Question"] does not return anything!


This is because the meaning of the two predicates (everything between [ and ]) is different. [text()="Ask Question"] actually means: return true if any of the text nodes of an element contains exactly the text "Ask Question". On the other hand, [.="Ask Question"] means: return true if the string value of an element is identical to "Ask Question".

In the XPath model, text inside XML elements can be partitioned into a number of text nodes if other elements interfere with the text, as in Example 2 above. There, the other element is between "Ask Question" and a newline character that also counts as text content.

To make an even clearer example, consider as an input document:

Example 3

<a>Ask Question<other/>more text</a>

Here, the a element actually contains two text nodes, "Ask Question" and "more text", since both are direct children of a. You can test this by running //a/text() on this document, which will return (individual results separated by ----):

Ask Question
-----------------------
more text

So, in such a scenario, text() returns a set of individual nodes, while . in a predicate evaluates to the string concatenation of all text nodes. Again, you can test this claim with the path expression //a[.='Ask Questionmore text'] which will successfully return the a element.


Finally, keep in mind that some XPath functions can only take one single string as an input. As LarsH has pointed out in the comments, if such an XPath function (e.g. contains()) is given a sequence of nodes, it will only process the first node and silently ignore the rest.

Difference between text() and string()

Can someone explain the difference between text() and string()
functions.

I. text() isn't a function but a node test.

It is used to select all text-node children of the context node.

So, if the context node is an element named x, then text() selects all text-node children of x.

Other examples:

/a/b/c/text()

selects all text-node children of any c element that is a child of any b element that is a child of the top element a.

II. The string() function

By definition string(exprSelectingASingleNode) returns the string value of the node.

The string value of an element is the concatenation of all of its text-node descendents -- in document order.

Therefore, if in the following XML document:

<a>
<b>2</b>
<c>3
<d>4</d>
</c>
5
</a>

string(/a) returns (without the surrounding quotes):

"
2
3
4

5
"

As we see, the string value reflects three white-space-only text-nodes, which we typically fail to notice and account for.

Some XML parsers have the option of stripping-off white-space-only text nodes. If the above document was parsed with the white-space-only text nodes stripped off, then the same function:

string(/a)

now returns:

"23
4
5
"

XPATH Testing string value of child and parent has empty string value

//span[text()='630']/parent::div[not(text())]

try this.

XPath's string-length() of a node without its children

XPath 1.0

XPath 1.0 alone cannot provide a character count of only the text node children of an element. You'll also have to use facilities of the language hosting the XPath library.

XPath 2.0 and up

This XPath (first posted by @MartinHonnen in comments),

sum(/node/text()/string-length())

will provide a character count of all text node children of /node, as requested.

See also

  • The definition of string value in XPath
  • Testing text() nodes vs string values in XPath

XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode

The <Comment> tag contains two text nodes and two <br> nodes as children.

Your xpath expression was

//*[contains(text(),'ABC')]

To break this down,

  1. * is a selector that matches any element (i.e. tag) -- it returns a node-set.
  2. The [] are a conditional that operates on each individual node in that node set. It matches if any of the individual nodes it operates on match the conditions inside the brackets.
  3. text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set.
  4. contains is a function that operates on a string. If it is passed a node set, the node set is converted into a string by returning the string-value of the node in the node-set that is first in document order. Hence, it can match only the first text node in your <Comment> element -- namely BLAH BLAH BLAH. Since that doesn't match, you don't get a <Comment> in your results.

You need to change this to

//*[text()[contains(.,'ABC')]]
  1. * is a selector that matches any element (i.e. tag) -- it returns a node-set.
  2. The outer [] are a conditional that operates on each individual node in that node set -- here it operates on each element in the document.
  3. text() is a selector that matches all of the text nodes that are children of the context node -- it returns a node set.
  4. The inner [] are a conditional that operates on each node in that node set -- here each individual text node. Each individual text node is the starting point for any path in the brackets, and can also be referred to explicitly as . within the brackets. It matches if any of the individual nodes it operates on match the conditions inside the brackets.
  5. contains is a function that operates on a string. Here it is passed an individual text node (.). Since it is passed the second text node in the <Comment> tag individually, it will see the 'ABC' string and be able to match it.

XPath - select element with inside text, even text of subelements

Try using normalize-space()...

//div[normalize-space() = 'I need this text']


Related Topics



Leave a reply



Submit