Encoding Xpath Expressions with Both Single and Double Quotes

Encoding XPath Expressions with both single and double quotes

Wow, you all sure are making this complicated. Why not just do this?

public static string XpathExpression(string value)
{
if (!value.Contains("'"))
return '\'' + value + '\'';

else if (!value.Contains("\""))
return '"' + value + '"';

else
return "concat('" + value.Replace("'", "',\"'\",'") + "')";
}

.NET Fiddle & test

Simultaneously escape double and single quotes in Xpath

The key here is realising that with xml2 you can write back into the parsed html with html-escaped characters. This function will do the trick. It's longer than it needs to be because I've included comments and some type checking / converting logic.

contains_text <- function(node_set, find_this)
{
# Ensure we have a nodeset
if(all(class(node_set) == c("xml_document", "xml_node")))
node_set %<>% xml_children()

if(class(node_set) != "xml_nodeset")
stop("contains_text requires an xml_nodeset or xml_document.")

# Get all leaf nodes
node_set %<>% xml_nodes(xpath = "//*[not(*)]")

# HTML escape the target string
find_this %<>% {gsub("\"", """, .)}

# Extract, HTML escape and replace the nodes
lapply(node_set, function(node) xml_text(node) %<>% {gsub("\"", """, .)})

# Now we can define the xpath and extract our target nodes
xpath <- paste0("//*[contains(text(), \"", find_this, "\")]")
new_nodes <- html_nodes(node_set, xpath = xpath)

# Since the underlying xml_document is passed by pointer internally,
# we should unescape any text to leave it unaltered
xml_text(node_set) %<>% {gsub(""", "\"", .)}
return(new_nodes)
}

Now:

library(rvest)
library(xml2)

html %>% xml2::read_html() %>% contains_text(target)
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>
html %>% xml2::read_html() %>% contains_text(target) %>% xml_text()
#> [1] "Fat\"her's son"

ADDENDUM

This is an alternative method, which is an implementation of the method suggested by @Alejandro but allows arbitrary targets. It has the merit of leaving the xml document untouched, and is a little faster than the above method, but involves the kind of string parsing that an xml library is supposed to prevent. It works by taking the target, splitting it after each " and ', then enclosing each fragment in the opposite type of quote to the one it contains before pasting them all back together with commas and inserting them into an XPath concatenate function.

library(stringr)

safe_xpath <- function(target)
{
target %<>%
str_replace_all("\"", ""&break;") %>%
str_replace_all("'", "&apo;&break;") %>%
str_split("&break;") %>%
unlist()

safe_pieces <- grep("(")|(&apo;)", target, invert = TRUE)
contain_quotes <- grep(""", target)
contain_apo <- grep("&apo;", target)

if(length(safe_pieces) > 0)
target[safe_pieces] <- paste0("\"", target[safe_pieces], "\"")

if(length(contain_quotes) > 0)
{
target[contain_quotes] <- paste0("'", target[contain_quotes], "'")
target[contain_quotes] <- gsub(""", "\"", target[contain_quotes])
}

if(length(contain_apo) > 0)
{
target[contain_apo] <- paste0("\"", target[contain_apo], "\"")
target[contain_apo] <- gsub("&apo;", "'", target[contain_apo])
}

fragment <- paste0(target, collapse = ",")
return(paste0("//*[contains(text(),concat(", fragment, "))]"))
}

Now we can generate a valid xpath like this:

safe_xpath(target)
#> [1] "//*[contains(text(),concat('Fat\"',\"her'\",\"s son\"))]"

so that

html %>% xml2::read_html() %>% html_nodes(xpath = safe_xpath(target))
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>

XPath attribute quoting in JavaScript

\ is not an escape character in XPath string literals. (If it was, you could just backslash-escape one of the quotes, and never have to worry about concat!) "\" is a complete string in itself, which is then followed by 'hi..., which doesn't make sense.

So there should be no backslashes in your output, it should look something like:

concat('"', "'hi'", '"')

I suggest:

function xpathStringLiteral(s) {
if (s.indexOf('"')===-1)
return '"'+s+'"';
if (s.indexOf("'")===-1)
return "'"+s+"'";
return 'concat("'+s.replace(/"/g, '",\'"\',"')+'")';
}

It's not quite as efficient as it might be (it'll include leading/trailing empty string segments if the first/last character is a double-quote), but that's unlikely to matter.

(Do you really mean let in the above? This is a non-standard Mozilla-only langauge feature; one would typically use var.)

How to write xpath for this particular webelement which has a double quotes?

You can escape double quotes with a backslash:

//*[@ng-click=\"navigateToNewCustomer('New Customer')\"].click();

Using quotes in Xpath

presumably this is all inside a string itself, so would this work?

"v:MapLink[@Entity=\"TOM'S RESTAURANT\"]"


Related Topics



Leave a reply



Submit