How do I do a regex search in Nokogiri for text that matches a certain beginning?
Use the xpath function starts-with
:
value.xpath('//p[starts-with(@id, "para-")]').each { |x| puts x['id'] }
Using a regex to get a Nokogiri node
There are two problems with your code: first off, there is no =~
operator in XPath. The way to test whether text matches a regex is using the matches
function:
//Phase[matches(@text, 'STER')]
Secondly, regex matching is a feature of XPath 2.0, but Nokogiri implements XPath 1.0.
Luckily, you are not actually using any regex features, you are simply checking for a fixed string, which can be done with XPath 1.0 using the contains
function:
//Phase[contains(@text, 'STER')]
Editing Text in a Nokogiri Element or Using Regex
#!/usr/bin/ruby1.8
require 'rubygems'
require 'nokogiri'
html = <<EOS
<ul>
<li>: blah blah blah</li>
<li>: foo bar baz</li>
</ul>
EOS
doc = Nokogiri::HTML.parse(html)
for li in doc.xpath('//li/text()')
li.content = li.content.gsub(/^: */, '')
end
puts doc.to_html
# => <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# => <html><body><ul>
# => <li>blah blah blah</li>
# => <li>foo bar baz</li>
# => </ul></body></html>
Regex parse using Nokogiri
To get a better answer you would have to clarify exactly what format the AB, CD and Dollar values take but here is a solution based on the example given. It uses a regexp grouping ()
to capture the information we're interested in. (see the bottom of the answer for more details)
text = p.css(".some_class").text
# one or more digits followed by a space followed by AB, capture the digits
ab = text.match(/(\d+) AB/).captures[0] # => "12"
# one of more non digits followed by a literal + followed by CD
cd = text.match(/(\d+\+) CD/).captures[0] # => "4+"
# digits or commas followed by "Dollars"
dollars = text.match(/([\d,]+) Dollars/).captures[0] # => "2,600"
Note that if there is no match then String#match
returns nil
so if the values might not exist you would need a check e.g.
if match = text.match(/([\d,]+) Dollars/)
dollars = match.captures[0]
end
Additional explanation of captures
To match the amount of AB we need a pattern /\d+ AB/
to identify the right part of the text. However, we're really only interested in the numeric part so we surround that with brackets so that we can extract it. e.g.
irb(main):027:0> match = text.match(/(\d+) AB/)
=> #<MatchData:0x2ca3440> # the match method returns MatchData if there is a match, nil if not
irb(main):028:0> match.to_s # match.to_s gives us the entire text that matched the pattern
=> "12 AB"
irb(main):029:0> match.captures
=> ["12"]
# match.captures gives us an array of the parts of the pattern that were enclosed in ()
# in our example there is just 1 but there could be multiple
irb(main):030:0> match.captures[0]
=> "12" # the first capture - the bit we want
Take a look at the documentation for MatchData, in particular the captures method for more details.
nokogiri select paragraph with text match
You could use Enumerable#find
in combination with a regexp match =~
to get the desired element content.
html.css(".reviewfold p").find { |e| e.text =~ /On Snow Feel/ }.text
Dealing with special character in Nokogiri / Regex
Unicode character space
You can use :
body.scan(/exhibit\p{Zs}99/i)
From the documentation about Unicode character’s General Category
:
/\p{Z}/ - 'Separator'
/\p{Zs}/ - 'Separator: Space'
It matches a whitespace or a non-breaking space, but no tab or newline. The string should be encoded in UTF-8. See this related question for more information.
non-word character
A more permissive regex would be :
body.scan(/exhibit\W99/i)
This allows any character other than a letter, a digit or an underscore between exhibit
and 99
. It would match a whitespace, a nbsp, a tab, a dash, ...
Related Topics
What Is Double Method in Rspec For
Find Records with Datetime That Match Today's Date - Ruby on Rails
What Alternatives to Irb Are There
Ruby Methods That Either Yield or Return Enumerator
How to Specify Post Params in a Rails Test
How to Convert an Existing Rails 3 Application into an Engine
Deleting Table from Schema - Rails
Ruby on Rails, Two Models in One Form
Reading the First Line of a File in Ruby
Ruby Koan 151 Raising Exceptions
Ruby Concatenate Strings and Add Spaces
Rails: Logging for Code in the Lib Directory
How to Log the Entire Trace Back of a Ruby Exception Using the Default Rails Logger
How to Require a Specific Version of a Ruby Gem
How to Replace Multiple Newlines in a Row with One Newline Using Ruby