How to Handle Double Quotes in String Before Xpath Evaluation

XPATH query with single quote

Use whatever way there is in your language (I don't know PHP) to escape the quote character inside a quoted string, something like this:

$nodes = $xml->xpath("//item[contains(@catalog,\"Billy's Blogs\")]/title"); 

Encoding XPath Expressions with both single and double quotes

Wow, you all sure are making this complicated. Why not just do this?

public static string XpathExpression(string value)
{
if (!value.Contains("'"))
return '\'' + value + '\'';

else if (!value.Contains("\""))
return '"' + value + '"';

else
return "concat('" + value.Replace("'", "',\"'\",'") + "')";
}

.NET Fiddle & test

How to use apostrophe (') in xpath while finding element using webdriver?

Use the xpath as shown below:

driver.findElements(By.xpath("//input[contains(@text,\"WE'd\")]"));

Hope this helps.

PHP DOMXPath works with double quotes fails with single quotes

The style of quotes makes a difference because when a string is enclosed in double-quotes PHP will interpret more escape sequences for special characters - including what you're using for non-breaking space \xC2\xA0, carriage return \r, and newline \n.

When you have these enclosed in single-quotes '\xC2\xA0\r\n', like in your second two queries, PHP treats them as those literal characters - backslash, x, C, 2... etc.


A little extra syntax highlighting may help show this off, escape sequences in orange:

Sample Image


If your string already has what would be escape sequences in it as literal characters, and there's no way to get that corrected*, you're in the kinda dirty position of replacing them yourself.

This preg_replace_callback() will take care of the sort of sequences in your example, and it's trivial to extend to the rest of the escape sequences supported by double-quotes:

// Known good.
$query1 = "substring-before(//div[@class='sku'],'\xC2\xA0\xC2\xA0\r\n')";

// Known bad.
$query2 = 'substring-before(//div[@class=\'sku\'],\'\xC2\xA0\xC2\xA0\r\n\')';

$query2 = preg_replace_callback(
'/\\\\(?:[rn]|(?:x[0-9A-Fa-f]{1,2}))/',
function ($matches) {
switch (substr($matches[0], 0, 2)) {
case '\r':
return "\r";
case '\n':
return "\n";
case '\x':
return hex2bin(substr($matches[0], 2));
}
},
$query2
);

var_dump($query1 === $query2); // Now equal?

Output:

bool(true)

(*Really, you should get this fixed at the source.)

How to properly escape single and double quotes

According to what we can see in Wikipedia and w3 school, you should not have ' and " in nodes content, even if only < and & are said to be stricly illegal. They should be replaced by corresponding "predefined entity references", that are ' and ".

By the way, the Python parsers I use will take care of this transparently: when writing, they are replaced; when reading, they are converted.

After a second reading of your answer, I tested some stuff with the ' and so on in Python interpreter. And it will escape everything for you!

>>> 'text {0}'.format('blabla "some" bla')
'text blabla "some" bla'
>>> 'ntsnts {0}'.format("ontsi'tns")
"ntsnts ontsi'tns"
>>> 'ntsnts {0}'.format("ontsi'tn' \"ntsis")
'ntsnts ontsi\'tn\' "ntsis'

So we can see that Python escapes things correctly. Could you then copy-paste the error message you get (if any)?

Simultaneously escape double and single quotes in Xpath

The key here is realising that with xml2 you can write back into the parsed html with html-escaped characters. This function will do the trick. It's longer than it needs to be because I've included comments and some type checking / converting logic.

contains_text <- function(node_set, find_this)
{
# Ensure we have a nodeset
if(all(class(node_set) == c("xml_document", "xml_node")))
node_set %<>% xml_children()

if(class(node_set) != "xml_nodeset")
stop("contains_text requires an xml_nodeset or xml_document.")

# Get all leaf nodes
node_set %<>% xml_nodes(xpath = "//*[not(*)]")

# HTML escape the target string
find_this %<>% {gsub("\"", """, .)}

# Extract, HTML escape and replace the nodes
lapply(node_set, function(node) xml_text(node) %<>% {gsub("\"", """, .)})

# Now we can define the xpath and extract our target nodes
xpath <- paste0("//*[contains(text(), \"", find_this, "\")]")
new_nodes <- html_nodes(node_set, xpath = xpath)

# Since the underlying xml_document is passed by pointer internally,
# we should unescape any text to leave it unaltered
xml_text(node_set) %<>% {gsub(""", "\"", .)}
return(new_nodes)
}

Now:

library(rvest)
library(xml2)

html %>% xml2::read_html() %>% contains_text(target)
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>
html %>% xml2::read_html() %>% contains_text(target) %>% xml_text()
#> [1] "Fat\"her's son"

ADDENDUM

This is an alternative method, which is an implementation of the method suggested by @Alejandro but allows arbitrary targets. It has the merit of leaving the xml document untouched, and is a little faster than the above method, but involves the kind of string parsing that an xml library is supposed to prevent. It works by taking the target, splitting it after each " and ', then enclosing each fragment in the opposite type of quote to the one it contains before pasting them all back together with commas and inserting them into an XPath concatenate function.

library(stringr)

safe_xpath <- function(target)
{
target %<>%
str_replace_all("\"", ""&break;") %>%
str_replace_all("'", "&apo;&break;") %>%
str_split("&break;") %>%
unlist()

safe_pieces <- grep("(")|(&apo;)", target, invert = TRUE)
contain_quotes <- grep(""", target)
contain_apo <- grep("&apo;", target)

if(length(safe_pieces) > 0)
target[safe_pieces] <- paste0("\"", target[safe_pieces], "\"")

if(length(contain_quotes) > 0)
{
target[contain_quotes] <- paste0("'", target[contain_quotes], "'")
target[contain_quotes] <- gsub(""", "\"", target[contain_quotes])
}

if(length(contain_apo) > 0)
{
target[contain_apo] <- paste0("\"", target[contain_apo], "\"")
target[contain_apo] <- gsub("&apo;", "'", target[contain_apo])
}

fragment <- paste0(target, collapse = ",")
return(paste0("//*[contains(text(),concat(", fragment, "))]"))
}

Now we can generate a valid xpath like this:

safe_xpath(target)
#> [1] "//*[contains(text(),concat('Fat\"',\"her'\",\"s son\"))]"

so that

html %>% xml2::read_html() %>% html_nodes(xpath = safe_xpath(target))
#> {xml_nodeset (1)}
#> [1] <div>Fat"her's son</div>

How to remove single and double quotes from a string

It looks like your original string had the HTML characters for " (") so when you attempt to sanitize it, you're simply remove the & and ;, leaving the rest of the string quot.

---EDIT---

Probably the easiest way to remove non alpha numeric characters would be to decode the HTML characters with html_entity_decode, then run it through the regular expression. Since, in this case, you won't get anything that needs to be re-coded, you don't need to then do htmlentities, but it's worth remembering that you had HTML data and you now have raw unencoded data.

Eg:

function string_sanitize($s) {
$result = preg_replace("/[^a-zA-Z0-9]+/", "", html_entity_decode($s, ENT_QUOTES));
return $result;
}

Note that ENT_QUOTES flags the function to "...convert both double and single quotes.".



Related Topics



Leave a reply



Submit