How to Do a Regex Search in Nokogiri for Text That Matches a Certain Beginning

How do I do a regex search in Nokogiri for text that matches a certain beginning?

Use the xpath function starts-with:

value.xpath('//p[starts-with(@id, "para-")]').each { |x| puts x['id'] }

Using a regex to get a Nokogiri node

There are two problems with your code: first off, there is no =~ operator in XPath. The way to test whether text matches a regex is using the matches function:

//Phase[matches(@text, 'STER')]

Secondly, regex matching is a feature of XPath 2.0, but Nokogiri implements XPath 1.0.

Luckily, you are not actually using any regex features, you are simply checking for a fixed string, which can be done with XPath 1.0 using the contains function:

//Phase[contains(@text, 'STER')]

Editing Text in a Nokogiri Element or Using Regex

#!/usr/bin/ruby1.8

require 'rubygems'
require 'nokogiri'

html = <<EOS
<ul>
<li>: blah blah blah</li>
<li>: foo bar baz</li>
</ul>
EOS

doc = Nokogiri::HTML.parse(html)
for li in doc.xpath('//li/text()')
li.content = li.content.gsub(/^: */, '')
end
puts doc.to_html

# => <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# => <html><body><ul>
# => <li>blah blah blah</li>
# => <li>foo bar baz</li>
# => </ul></body></html>

Regex parse using Nokogiri

To get a better answer you would have to clarify exactly what format the AB, CD and Dollar values take but here is a solution based on the example given. It uses a regexp grouping () to capture the information we're interested in. (see the bottom of the answer for more details)

text = p.css(".some_class").text

# one or more digits followed by a space followed by AB, capture the digits
ab = text.match(/(\d+) AB/).captures[0] # => "12"

# one of more non digits followed by a literal + followed by CD
cd = text.match(/(\d+\+) CD/).captures[0] # => "4+"

# digits or commas followed by "Dollars"
dollars = text.match(/([\d,]+) Dollars/).captures[0] # => "2,600"

Note that if there is no match then String#match returns nil so if the values might not exist you would need a check e.g.

if match = text.match(/([\d,]+) Dollars/)
dollars = match.captures[0]
end

Additional explanation of captures

To match the amount of AB we need a pattern /\d+ AB/ to identify the right part of the text. However, we're really only interested in the numeric part so we surround that with brackets so that we can extract it. e.g.

irb(main):027:0> match = text.match(/(\d+) AB/)
=> #<MatchData:0x2ca3440> # the match method returns MatchData if there is a match, nil if not
irb(main):028:0> match.to_s # match.to_s gives us the entire text that matched the pattern
=> "12 AB"
irb(main):029:0> match.captures
=> ["12"]
# match.captures gives us an array of the parts of the pattern that were enclosed in ()
# in our example there is just 1 but there could be multiple
irb(main):030:0> match.captures[0]
=> "12" # the first capture - the bit we want

Take a look at the documentation for MatchData, in particular the captures method for more details.

nokogiri select paragraph with text match

You could use Enumerable#find in combination with a regexp match =~ to get the desired element content.

html.css(".reviewfold p").find { |e| e.text =~ /On Snow Feel/ }.text

Dealing with special character in Nokogiri / Regex

Unicode character space

You can use :

body.scan(/exhibit\p{Zs}99/i)

From the documentation about Unicode character’s General Category:

/\p{Z}/ - 'Separator'
/\p{Zs}/ - 'Separator: Space'

It matches a whitespace or a non-breaking space, but no tab or newline. The string should be encoded in UTF-8. See this related question for more information.

non-word character

A more permissive regex would be :

body.scan(/exhibit\W99/i)

This allows any character other than a letter, a digit or an underscore between exhibit and 99. It would match a whitespace, a nbsp, a tab, a dash, ...



Related Topics



Leave a reply



Submit