Xpath Expression for Regex-Like Matching

xpath expression for regex-like matching?

How about this (updated):

XPath 1.0:

"//div[substring-before(@id, '_') = 'foo' 
       and substring-after(@id, '_') >= 0 
       and substring-after(@id, '_') <= 99999999]"

Edit #2: The OP made a change to the question. The following, even more reduced XPath 1.0 expression works for me:

"//div[substring(@id, 1, 13) = 'post_message_' 
       and substring(@id, 14) >= 0 
       and substring(@id, 14) <= 99999999]"

XPath 2.0 has a convenient matches() function:

"//div[matches(@id, '^foo_\d{1,8}$')]"

Apart from the better portability, I would expect the numerical expression (XPath 1.0 style) to perform better than the regex test, though this would only become noticeable when processing large data sets.

Original version of the answer:

"//div[substring-before(@id, '_') = 'foo' 
       and number(substring-after(@id, '_')) = substring-after(@id, '_') 
       and number(substring-after(@id, '_')) >= 0 
       and number(substring-after(@id, '_')) <= 99999999]"

The use of the number() function is unnecessary, because the mathematical comparison operators coerce their arguments to numbers implicitly, any non-numbers will become NaN and the greater than/less than tests will fail.

I also removed the encoding of the angle brackets, since this is an XML requirement, not an XPath requirement.

How to use regex in XPath contains function

XPath 1.0 doesn't handle regex natively, you could try something like

//*[starts-with(@id, 'sometext') and ends-with(@id, '_text')]

(as pointed out by paul t, //*[boolean(number(substring-before(substring-after(@id, "sometext"), "_text")))] could be used to perform the same check your original regex does, if you need to check for middle digits as well)

In XPath 2.0, try

//*[matches(@id, 'sometext\d+_text')]

Why is my XPath with regex failing to match?

Use matches(), which matches against a regex, rather than contains(), which tests for literal substring containment.

I'd also suggest using . rather than text() as it's the string value of the element that's your real goal to match, not really a text() node child.

Altogether, the XPath for selecting the targeted element would be:

//*[@class='body' and matches(text(),'(20\d{2}).(\d{1,2}).(\d{1,2})')]

I would like to get thi1te_t in the final output, probably with regex ^.{8}$ and grep.

You can return that substring by tokenizing the string value of the element matched by the above XPath and then selecting the line that matches your target regex:

tokenize(//*[@class='body' and matches(text(),'(20\d{2}).(\d{1,2}).(\d{1,2})')], 
        '\s*\n\s*')[matches(.,'^.{8}$')]

This XPath expression returns thi1te_t, as requested.

Can I use a Regex in an XPath expression?

As other answers have noted, XPath 1.0 does not support regular expressions.

Nonetheless, you have the following options:

Use an XPath 1.0 expression (note the starts-with() and translate() functions) like this:


.//div
   [starts-with(@id, 'foo') 
  and 
   'foo' = translate(@id, '0123456789', '')
  and
   string-length(@id) > 3   
   ]

Use EXSLT.NET -- there is a way to use its functions directly in XPath expressions without having to use XSLT. The EXSLT extension functions that allow RegEx-es to be used are: regexp:match(), regexp:replace() and regexp:test()
Use XPath 2.0/XSLT 2.0 and its inbuilt support for regular expressions (the functions matches(), replace() and tokenize())

XPath with regex match on an attribute value

I'm trying to get the total number of
event nodes that contain the text '
doubles ' in the value of the
description attribute.

matches() is a standard XPath 2.0 function. It is not available in XPath 1.0.

You can use:

count(/*/*/event[contains(@description, ' doubles ')])

To verify this, here is a small XSLT transformation which just outputs the result of evaluating the above XPath expression on the provided XML document:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="text"/>


 <xsl:template match="/">
  <xsl:value-of select=
  "count(/*/*/event[contains(@description, ' doubles ')])"/>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied on the provided XML document:

<game id="2009/05/02/arimlb-milmlb-1" pk="244539">
    <team id="109" name="Arizona" home_team="false">
        <event number="9" inning="1" description="Felipe Lopez doubles to left fielder Chris Duffy.  "/>
        <event number="15" inning="1" description="Augie Ojeda flies out to center fielder Mike Cameron.  "/>
        <event number="23" inning="1" description="Chad Tracy doubles to right fielder Joe Sanchez.  "/>
        <event number="52" inning="2" description="Mark Reynolds lines out to left fielder Chris Duffy.  "/>
        <!-- more data here -->
    </team>
</game>

the wanted, correct result is produced:

X-Path + RegEx matching pattern

There are two problems with your expression:

No semicolon in the end of an XPath expression (syntax error).
Your regex is messed up, it matches everything that does not contain anything out of the character class parentheses, digits, curly brackets, the digit 5, spaces, and the star and plus character.
fn:matches(xs:string?, xs:string) requires two strings as parameters, you're passing a sequence of strings for the first one.

To call a function for each node in an axis step, add it as another one (XPath 2.0 and above only). You can use the dot . (context) in the arguments.

Try something like

./Supplier/matches(., "^(\d{5}\s*)+$")

which will yield true for the third and fifth row. If it only must contain (and not fully constructed from) the repeating pattern of fife-digit-numbers and spaces, remove the ^ and $ from the regular expression.

Xpath Expression for Regex-Like Matching