Xpath to Select Between Two HTML Comments Is Not Working

XPath to select between two HTML comments?

I would look for elements that are preceded by the first comment and followed by the second comment:

doc.xpath("//*[preceding::comment()[. = ' begin content ']]
[following::comment()[. = ' end content ']]")
#=> <div>some text</div>
#=> <div>
#=> <p>Some more elements</p>
#=> </div>
#=> <p>Some more elements</p>

Note that the above gives you each element in between. This means that if you iterate through each the returned nodes, you will get some duplicated nested nodes - eg the "Some more elements".

I think you might actually want to just get the top-level nodes in between - ie the siblings of the comments. This can be done using the preceding/following-sibling instead.

doc.xpath("//*[preceding-sibling::comment()[. = ' begin content ']]
[following-sibling::comment()[. = ' end content ']]")
#=> <div>some text</div>
#=> <div>
#=> <p>Some more elements</p>
#=> </div>

Update - Including comments

Using //* only returns element nodes, which does not include comments (and some others). You could change * to node() to return everything.

puts doc.xpath("//node()[preceding-sibling::comment()[. = 'begin content']]
[following-sibling::comment()[. = 'end content']]")
#=> <!--keywords1: first_keyword-->
#=> <div>html</div>

If you just want element nodes and comments (ie not everything), you can use the self axis:

doc.xpath("//node()[self::* or self::comment()]
[preceding-sibling::comment()[. = 'begin content']]
[following-sibling::comment()[. = 'end content']]")
#~ #=> <!--keywords1: first_keyword-->
#~ #=> <div>html</div>

XPath selecting text between comments

You can get required output with below XPath expressions

//p/text()[string-length(.)>0] # for date
//p/a/@href # for link
//p/a/text() # for link text

If you still want to use those comments in XPath:

//p/text()[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]] # for date

//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/@href # for links

//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/text() # for link text

XPath selecting between comments multiple times

Add a predicate that states that you want the first preceding comment and the first following comment.

So, for example, to get the contents between the comments that starts with "comment 1":

//*[preceding-sibling::comment()[1][contains(., 'comment 1')]]
[following-sibling::comment()[1][contains(., 'end content')]]

Similarly, to get the contents between the comments that starts with "comment 2":

//*[preceding-sibling::comment()[1][contains(., 'comment 2')]]
[following-sibling::comment()[1][contains(., 'end content')]]

XPath - extracting text between two nodes

You should be able to just test the first preceding sibling h5...


XPATH substring before and after to return text between two html tags

You want the nodes that directly follow a particular <h4>, where "directly follow" can be expressed as "the first preceding <h4> is the one we started at" (well, and the node in question is not an <h4> itself, of course).

This expression(*)

//h4[. = 'Start here']/following-sibling::*[not(self::h4) and preceding-sibling::h4[1][. = 'Start here']]

selects from this document

<h4>Not relevant</h4>
<p>Other stuff</p>
<h4>Start here</h4>
<p>Text stuff 1</p>
<p>Text stuff 2</p>
<h4>Stop here</h4>
<p>Other stuff</p>

these nodes

<p>Text stuff 1</p>
<p>Text stuff 2</p>

You can extract/join their text values in the host application.

(*) could also be written as //*[not(self::h4) and preceding-sibling::h4[1][. = 'Start here']], but that one has to check more nodes, i.e. all nodes in the document, as opposed to only the following-sibling axis of one particular node.

Xpath to select commented code

Use the union operator "|":


This will return the text of the comment, which cannot be used for further parsing even though it contains markup-like text. It's just a string so you'll have to treat it like one.

Xpath. How to select all text between two tags?

<a> is a sibling of <pre>, not the text(). You can use preceding::a instead (and similarly for following).

Related Topics

Leave a reply
