Xpath to Select Between Two HTML Comments Is Not Working

XPath to select between two HTML comments?

I would look for elements that are preceded by the first comment and followed by the second comment:

doc.xpath("//*[preceding::comment()[. = ' begin content ']]
[following::comment()[. = ' end content ']]")
#=> <div>some text</div>
#=> <div>
#=> <p>Some more elements</p>
#=> </div>
#=> <p>Some more elements</p>

Note that the above gives you each element in between. This means that if you iterate through each the returned nodes, you will get some duplicated nested nodes - eg the "Some more elements".

I think you might actually want to just get the top-level nodes in between - ie the siblings of the comments. This can be done using the preceding/following-sibling instead.

doc.xpath("//*[preceding-sibling::comment()[. = ' begin content ']]
[following-sibling::comment()[. = ' end content ']]")
#=> <div>some text</div>
#=> <div>
#=> <p>Some more elements</p>
#=> </div>

Update - Including comments

Using //* only returns element nodes, which does not include comments (and some others). You could change * to node() to return everything.

puts doc.xpath("//node()[preceding-sibling::comment()[. = 'begin content']]
[following-sibling::comment()[. = 'end content']]")
#=>
#=> <!--keywords1: first_keyword-->
#=>
#=> <div>html</div>
#=>

If you just want element nodes and comments (ie not everything), you can use the self axis:

doc.xpath("//node()[self::* or self::comment()]
[preceding-sibling::comment()[. = 'begin content']]
[following-sibling::comment()[. = 'end content']]")
#~ #=> <!--keywords1: first_keyword-->
#~ #=> <div>html</div>

XPath selecting text between comments

You can get required output with below XPath expressions

//p/text()[string-length(.)>0] # for date
//p/a/@href # for link
//p/a/text() # for link text

If you still want to use those comments in XPath:

//p/text()[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]] # for date

//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/@href # for links

//p/a[preceding-sibling::comment()[1][contains(., 'start template: articleLists/indexHeadline.html')]]
[following-sibling::comment()[1][contains(., 'end template: articleLists/indexHeadline.html')]]/text() # for link text

XPath selecting between comments multiple times

Add a predicate that states that you want the first preceding comment and the first following comment.

So, for example, to get the contents between the comments that starts with "comment 1":

//*[preceding-sibling::comment()[1][contains(., 'comment 1')]]
[following-sibling::comment()[1][contains(., 'end content')]]

Similarly, to get the contents between the comments that starts with "comment 2":

//*[preceding-sibling::comment()[1][contains(., 'comment 2')]]
[following-sibling::comment()[1][contains(., 'end content')]]

XPath - extracting text between two nodes

You should be able to just test the first preceding sibling h5...

//text()[preceding-sibling::h5[1][normalize-space()='SecondHeader']]

XPATH substring before and after to return text between two html tags

You want the nodes that directly follow a particular <h4>, where "directly follow" can be expressed as "the first preceding <h4> is the one we started at" (well, and the node in question is not an <h4> itself, of course).

This expression(*)

//h4[. = 'Start here']/following-sibling::*[not(self::h4) and preceding-sibling::h4[1][. = 'Start here']]

selects from this document

<body>
<h4>Not relevant</h4>
<p>Other stuff</p>
<h4>Start here</h4>
<p>Text stuff 1</p>
<p>Text stuff 2</p>
<h4>Stop here</h4>
<p>Other stuff</p>
</body>

these nodes

<p>Text stuff 1</p>
<p>Text stuff 2</p>

You can extract/join their text values in the host application.


(*) could also be written as //*[not(self::h4) and preceding-sibling::h4[1][. = 'Start here']], but that one has to check more nodes, i.e. all nodes in the document, as opposed to only the following-sibling axis of one particular node.

Xpath to select commented code

Use the union operator "|":

descendant-or-self::link|/*/comment()

This will return the text of the comment, which cannot be used for further parsing even though it contains markup-like text. It's just a string so you'll have to treat it like one.

Xpath. How to select all text between two tags?

<a> is a sibling of <pre>, not the text(). You can use preceding::a instead (and similarly for following).



Related Topics



Leave a reply



Submit