How to Create a Nokogiri Case Insensitive Xpath Selector

How can I create a nokogiri case insensitive Xpath selector?

Wrapped for legibility:

puts page.parser.xpath("
//meta[
translate(
@name,
'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'abcdefghijklmnopqrstuvwxyz'
) = 'keywords'
]
").to_html

There is no "to lower case" function in XPath 1.0, so you have to use translate() for this kind of thing. Add accented letters as necessary.

How do I make my Nokogiri :contains case insensitive?

With CSS selector rules this should not be possible as far as I know. But XPath 2.0 would able to check for text case insensitive either by transforming the text content with upper-case() or using matches() with third parameter 'i' instead of contains(), which will match with a case insensitive regular expression. Nokogiri internally transforms CSS selectors into an XPath query, so your example becomes //a[contains(., "MY TEXT"). However, Nokogiri's XML features are based on libxml2 (MRI Ruby) or javax.xml.xpath (JRuby) which do not support Xpath 2.0.

If this was supported you could just replace the CSS selector with this XPath query:

//a[contains(upper-case(.), "MY TEXT")]

But you can just implement the text comparison directly in ruby like this:

a_elt = doc.xpath('//a').detect { |node| /MY TEXT/i === node.text }

How can I create a nokogiri case insensitive text * search?

The lower-case XPath function is not available but you can use the translate XPath 1.0 function to convert your text to lowercase e.g. for the English alphabet:

translate(text(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')

I couldn't seem to use this in combination with the *= operator but you can use contains to do a substring search instead, making the full thing:

doc.search("//*[contains(translate(text(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'),'philip morris')]")

How do I write a CSS selector that looks for an element starting with text in a case-insensitive way?

Summary It's ugly. You're better off just using Ruby:

doc.css('select#select_id > option').select{ |opt| opt.text =~ /^ABC/i }

Details

Nokogiri uses libxml2, which uses XPath to search XML and HTML documents. Nokogiri transforms ~CSS expressions into XPath. For example, for your ~CSS selector, this is what Nokogiri actually searches for:

 Nokogiri::CSS.xpath_for("#select_id option:starts-with('ABC')")
#=> ["//*[@id = 'select_id']//option[starts-with(., 'ABC')]"]

The expression you wrote is not actually CSS. There is no :starts-with() pseudo-class in CSS, not even proposed in Selectors 4. What there is is the starts-with() function in XPath, and Nokogiri is (somewhat surprisingly) allowing you to mix XPath functions into your CSS and carrying them over to the XPath it uses internally.

The libxml2 library is limited to XPath 1.0, and in XPath 1.0 case-insensitive searches are done by translating all characters to lowercase. The XPath expression you'd want is thus:

//select[@id='select_id']/option[starts-with(translate(.,'ABC','abc'),'abc')]

(Assuming you only care about those characters!)

I'm not sure that you CAN write CSS+XPath in a way that Nokogiri would produce that expression. You'd need to use the xpath method and feed it that query.

Finally, you can create your own custom CSS pseudo-classes and implement them in Ruby. For example:

class MySearch
def insensitive_starts_with(nodes, str)
nodes.find_all{ |n| n.text =~ /^#{Regex.escape(str)}/i }
end
end

doc.css( "select#select_id > option:insensitive_starts_with('ABC')", MySearch )

...but all this gives you is re-usability of your search code.

How can I make all XML tags lowercase in Nokogiri?

If you want to transform your xml document by downcase'ing all tag names, here's one way to do it:

parsed = Nokogiri::XML.parse(xml_content)
parsed.traverse do |node|
node.name = node.name.downcase if node.kind_of?(Nokogiri::XML::Element)
end

How to match a case insensitive value with XPath

Scrapy Selectors are built over the libxml2 library, which, AFAIK, doesn't support XPath 2.0. At least libxslt does not for sure.

You can use XPath 1.0 translate() to solve this. In general it will look like:

translate(yourString, 
'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
'abcdefghijklmnopqrstuvwxyz')


Related Topics



Leave a reply



Submit