Is there a way to escape non-alphanumeric characters in Nokogiri css?
Assuming your HTML looks something like this:
<div id='stuff-morestuff-CHP-1-SECT-2.1'>foo</div>
The string in question, stuff-morestuff-CHP-1-SECT-2.1
, is a valid HTML ID, but it isn’t a valid CSS selector — the .
character isn’t valid there.
You should be able to escape the .
with a slash character, i.e. this is a valid CSS selector:
#stuff-morestuff-CHP-1-SECT-2\.1
Unfortunately this doesn’t seem to work in Nokogiri, there may be a bug in the CSS to XPath translation that it does. (It does work in the browser).
You can get around this by just checking the id
attribute directly:
documentFragment.at_css('*[id="stuff-morestuff-CHP-1-SECT-2.1"]')
Even if slash escaping worked, you would probably have to check the id
attribute like this if it value started with a digit, which is valid in HTML but cannot be (as far as I can tell) expressed as a CSS selector, even with escaping.
You could also use XPath, which has an id
function that you can use here:
documentFragment.xpath("id('stuff-morestuff-CHP-1-SECT-2.1')")
How to scrape URL/text, when the id contains special characters using Nokogiri
I had figured how to solve it. I used XPATH than CSS.
I change this code:
def marka(css)
@page.css(css).text
end
puts x.marka("a#searchByConstraints:form:j_idt491:0:j_idt493:0:j_idt495")
To this:
def marka(css)
@page.xpath(css).text
end
puts x.marka("//*[@id='searchByConstraints:form:j_idt491:0:j_idt493:0:j_idt495']")
How to get Nokogiri inner_HTML object to ignore/remove escape sequences
page.at_css("td[custom-attribute='foo']")
.parent
.css('td')
.css('a')
.text # since you need a text, not inner_html
.strip # this will strip a result
String#strip
.
Sidenote: css('td a')
is likely more efficient than css('td').css('a')
.
Nokogiri CSS selector with number as class
This question is similar to another recent question. That question asked about id
s though, the solution for classes is slightly different.
The problem is that although 0
is valid as an HTML class value, it’s not valid as a CSS class selector as they cannot start with a number.
You can work around this using the [att~=val]
attribute selector like this:
pg.css("tr.classA.classB[class~='0']")
This will match all tr
elements that are in all the classes classA
, classB
and 0
.
Nokogiri: Parsing Irregular
The "less than" (<) isn't legal HTML, but browsers have a lot of code for figuring out what was meant by the HTML instead of just displaying an error. That's why your invalid HTML sample displays the way you'd want it to in browsers.
So the trick is to make sure Nokogiri does the same work to compensate for bad HTML. Make sure to parse the file as HTML instead of XML:
f = File.open("table.html")
doc = Nokogiri::HTML(f)
This parses your file just fine, but throws away the < 1 g
text. Look at how the content of the first 2 TD elements is parsed:
doc.xpath('(//td)[1]/text()').to_s
=> "\n "
doc.xpath('(//td)[2]/text()').to_s
=> "0 %"
Nokogiri threw out your invalid text, but kept parsing the surrounding structure. You can even see the error message from Nokogiri:
doc.errors
=> [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: invalid element name>]
doc.errors[0].line
=> 3
Yup, line 3 is bad.
So it seems like Nokogiri doesn't have the same level of support for parsing invalid HTML as browsers do. I recommend using some other library to pre-process your files. I tried running TagSoup on your sample file and it fixed the <
by changing it to <
like so:
% java -jar tagsoup-1.1.3.jar foo.html | xmllint --format -
src: foo.html
<?xml version="1.0" standalone="yes"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<table>
<tbody>
<tr>
<th colspan="1" rowspan="1">Total Weight</th>
<td colspan="1" rowspan="1"><1 g</td>
<td colspan="1" rowspan="1" style="text-align: right">0 %</td>
</tr>
<tr>
<td colspan="3" rowspan="1" class="skinny_black_bar"/>
</tr>
</tbody>
</table>
</body>
</html>
CSS attribute selectors: The rules on quotes (, ' or none?)
I’ve written more extensively on the subject here: Unquoted attribute values in HTML and CSS.
I’ve also created a tool to help you answer your question: http://mothereff.in/unquoted-attributes
You can usually omit the quotes as long as the attribute value is alphanumeric (however, there are some exceptions — see the linked article for all the details). Anyhow, I find it to be good practice to add the quotes anyway in case you need them, i.e. a[href^=http://]
won’t work, but a[href^="http://"]
will.
The article I mentioned links to the appropriate chapters in the CSS spec.
Related Topics
How to Increase The Width of The Tooltip in Bootstrap-Vue
CSS Column-Count Not Respected
Path-Relative Style Sheet Import Vulnerabilities
How to Adjust Bootstrap's Container Div to 100Px Off The Left Viewport Edge
Aggressive Caching: Do All Browsers Support Url Parameter for Updating
How to Place a Badge in Lower Right Corner of a "Media" in Bootstrap
CSS Pseudo Element (Triangle Outside The Tr) Position Misaligned When Scrollbar Appears
Media Queries Running Weird Because of Non-Integer Width
How to Add Linear-Gradient Color to Slider
Django 1.8 Static Files Doesnt Work
Chrome Print Preview Doesn't Load @Media Only Print Font-Face
Ie 11 Ignores Min-Width When Using Flex Width
Universal CSS Selector to Match Any and All HTML Data-* Attributes
How to Create a Row of Justified Elements with Fluid Spacing Using CSS
Why Does a Rect Require Width and Height Attribute in Firefox
Mix-Blend-Mode Doesn't Work on Chrome
-Webkit-Text-Fill-Color: Transparent; Not Working in Safari 7.1.7