how to make Nokogiri not to convert to space
I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.
nbsp = Nokogiri::HTML(" ").text
text.gsub(nbsp, " ")
In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a " ", so you could do something like:
nbsp = Nokogiri::HTML(" ").text
html.gsub(nbsp, " ")
How to deal with   when using Nokogiri
First, don't use search
unless you want a NodeSet returned. A NodeSet acts like an array of Nodes, so you have to be prepared to iterate over them, or you can get some really weird results.
Instead, start with something like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="total_count">
<b>Returned:</b> 97 results
</div>
EOT
doc.at('div').text.scan(/\d+/) # => ["97"]
doc.at('div').text[/\d+/] # => "97"
at
returns the first node matching the selector. In this case it's the <div>
. I can use class selectors too:
doc.at('.total_count').text[/\d+/] # => "97"
Next, instead of trying to use gsub
to remove what you don't want, use a regular expression to match what you DO want. I repeatedly see code that gets that concept wrong, so make that a mantra. When using a regular expression, if you're trying to find or capture something, use a match. If you're removing or changing stuff use sub
or gsub
. Very, very, occasionally you'll have to mix the two, but it should be a rare exception.
The current version of Nokogiri (1.6.0) using libxml (2.8.0), on a current version of Ruby (2.0.0) returns the <div>
text node:
doc.at('div').text # => "\n Returned:\u00A097\u00A0results\n"
There is no 4
, so if you are seeing anything different then you need to upgrade Ruby, Nokogiri and maybe even your libXML2.
You can check the version information using nokogiri -v
at the command-line. You should see something like:
# Nokogiri (1.6.0)
---
warnings: []
nokogiri: 1.6.0
ruby:
version: 2.0.0
platform: x86_64-darwin12.4.0
description: ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]
engine: ruby
libxml:
binding: extension
source: packaged
libxml2_path: /Users/tinman/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxml2/2.8.0
libxslt_path: /Users/tinman/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxslt/1.1.26
compiled: 2.8.0
loaded: 2.8.0
Unable to strip a stubborn space from a Ruby string (Nokogiri is involved)
I tried to get the same error like you and created this example:
require 'nokogiri'
html = Nokogiri::HTML(<<-html
<td width='400' valign=top>
<b><u>Jenny ID:</u> 8675309</b><br />
Name of Place<br />
Street Address<br />
City, State, Zip<br />
Contact: Jenny Jenny<br />
Phone: 867-5309<br />
Fax:
</td>
html
)
el = html.css('b').first
txt = el.content.split(':').last
puts txt # ' 8675309'
p txt #"\u00A08675309"
p txt.strip #"\u00A08675309"
The leading character is no space, but \u00A0
(The Unicode Character 'NO-BREAK SPACE' (U+00A0)). It seems strip
does not remove it.
If you remove the no-break space explicit, you get the result you want. If you replace \u00A0
with ' '
(a normal space), then you can remove the space with strip without removing it inside the string.
Code:
p txt.gsub("\u00A0", ' ').strip #-> "8675309"
Alternative you can use (thanks to mu is too short)
p txt.gsub(/\p{Space}/, ' ').strip
This requires UTF-8 code. Without you may get an Encoding::CompatibilityError.
Using Nokogiri, how to convert html to text respecting block elements (ensuring they result in line breaks)
You can use #before
and #after
to add newlines:
doc.search('p,div,br').each{ |e| e.after "\n" }
How to unescape HTML in Nokogiri Ruby, so & remains & and not &
Use content
instead of inner_html
to get the content as plain text instead of (X)HTML.
irb(main):011:0> doc.at('head/title').content
=> "Foo & Bar"
Related Topics
Rails Fields_For Form Not Showing Up, Nested Form
Adding a Staging Environment to the Workflow
How to Install Ruby on Rails 3 on Osx
How to Manage Multiple Gemsets and Ruby Versions with Rvm
How to Get Parent Node in Capybara
How to Time an Operation in Milliseconds in Ruby
Ruby JSON Parse Changes Hash Keys
How to Make the Class Constructor Private in Ruby
Nicely Formatting Output to Console, Specifying Number of Tabs
How to Create a Delete Link for a Related Object in Ruby on Rails
What Does ':Location => ...' and 'Head :Ok' Mean in the 'Respond_To' Format Statement
Jekyll on Windows: Pygments Not Working
In Ruby, What Structures Can a 'Rescue' Statement Be Nested In
Setting Request Headers in Ruby
How to Save Unescaped & in Nokogiri Xml