How to Escape Non-Alphanumeric Characters in Nokogiri CSS

Is there a way to escape non-alphanumeric characters in Nokogiri css?

Assuming your HTML looks something like this:

<div id='stuff-morestuff-CHP-1-SECT-2.1'>foo</div>

The string in question, stuff-morestuff-CHP-1-SECT-2.1, is a valid HTML ID, but it isn’t a valid CSS selector — the . character isn’t valid there.

You should be able to escape the . with a slash character, i.e. this is a valid CSS selector:

#stuff-morestuff-CHP-1-SECT-2\.1

Unfortunately this doesn’t seem to work in Nokogiri, there may be a bug in the CSS to XPath translation that it does. (It does work in the browser).

You can get around this by just checking the id attribute directly:

documentFragment.at_css('*[id="stuff-morestuff-CHP-1-SECT-2.1"]')

Even if slash escaping worked, you would probably have to check the id attribute like this if it value started with a digit, which is valid in HTML but cannot be (as far as I can tell) expressed as a CSS selector, even with escaping.

You could also use XPath, which has an id function that you can use here:

documentFragment.xpath("id('stuff-morestuff-CHP-1-SECT-2.1')")

How to scrape URL/text, when the id contains special characters using Nokogiri

I had figured how to solve it. I used XPATH than CSS.

I change this code:

  def marka(css)
@page.css(css).text
end
puts x.marka("a#searchByConstraints:form:j_idt491:0:j_idt493:0:j_idt495")

To this:

def marka(css)
@page.xpath(css).text
end
puts x.marka("//*[@id='searchByConstraints:form:j_idt491:0:j_idt493:0:j_idt495']")

How to get Nokogiri inner_HTML object to ignore/remove escape sequences

page.at_css("td[custom-attribute='foo']")
.parent
.css('td')
.css('a')
.text # since you need a text, not inner_html
.strip # this will strip a result

String#strip.

Sidenote: css('td a') is likely more efficient than css('td').css('a').

Nokogiri CSS selector with number as class

This question is similar to another recent question. That question asked about ids though, the solution for classes is slightly different.

The problem is that although 0 is valid as an HTML class value, it’s not valid as a CSS class selector as they cannot start with a number.

You can work around this using the [att~=val] attribute selector like this:

pg.css("tr.classA.classB[class~='0']")

This will match all tr elements that are in all the classes classA, classB and 0.

Nokogiri: Parsing Irregular

The "less than" (<) isn't legal HTML, but browsers have a lot of code for figuring out what was meant by the HTML instead of just displaying an error. That's why your invalid HTML sample displays the way you'd want it to in browsers.

So the trick is to make sure Nokogiri does the same work to compensate for bad HTML. Make sure to parse the file as HTML instead of XML:

f = File.open("table.html")
doc = Nokogiri::HTML(f)

This parses your file just fine, but throws away the < 1 g text. Look at how the content of the first 2 TD elements is parsed:

doc.xpath('(//td)[1]/text()').to_s
=> "\n "

doc.xpath('(//td)[2]/text()').to_s
=> "0 %"

Nokogiri threw out your invalid text, but kept parsing the surrounding structure. You can even see the error message from Nokogiri:

doc.errors
=> [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: invalid element name>]
doc.errors[0].line
=> 3

Yup, line 3 is bad.

So it seems like Nokogiri doesn't have the same level of support for parsing invalid HTML as browsers do. I recommend using some other library to pre-process your files. I tried running TagSoup on your sample file and it fixed the < by changing it to < like so:

% java -jar tagsoup-1.1.3.jar foo.html | xmllint --format -
src: foo.html
<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<table>
<tbody>
<tr>
<th colspan="1" rowspan="1">Total Weight</th>
<td colspan="1" rowspan="1"><1 g</td>
<td colspan="1" rowspan="1" style="text-align: right">0 %</td>
</tr>
<tr>
<td colspan="3" rowspan="1" class="skinny_black_bar"/>
</tr>
</tbody>
</table>
</body>
</html>

CSS attribute selectors: The rules on quotes (, ' or none?)

I’ve written more extensively on the subject here: Unquoted attribute values in HTML and CSS.

I’ve also created a tool to help you answer your question: http://mothereff.in/unquoted-attributes

Unquoted attribute value validator

You can usually omit the quotes as long as the attribute value is alphanumeric (however, there are some exceptions — see the linked article for all the details). Anyhow, I find it to be good practice to add the quotes anyway in case you need them, i.e. a[href^=http://] won’t work, but a[href^="http://"] will.

The article I mentioned links to the appropriate chapters in the CSS spec.



Related Topics



Leave a reply



Submit