How to Make Nokogiri Not to Convert &Nbsp; to Space

how to make Nokogiri not to convert to space

I encountered a similar situation, and what I came up was a bit of a hack, but it seems to work well.

nbsp = Nokogiri::HTML(" ").text
text.gsub(nbsp, " ")

In my case, I wanted the nbsp to be a regular space. I think in your case, you want them to be returned to a " ", so you could do something like:

nbsp = Nokogiri::HTML(" ").text
html.gsub(nbsp, " ")

How to deal with when using Nokogiri

First, don't use search unless you want a NodeSet returned. A NodeSet acts like an array of Nodes, so you have to be prepared to iterate over them, or you can get some really weird results.

Instead, start with something like this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div class="total_count">
    <b>Returned:</b> 97 results
</div>
EOT

doc.at('div').text.scan(/\d+/) # => ["97"]
doc.at('div').text[/\d+/] # => "97"

at returns the first node matching the selector. In this case it's the <div>. I can use class selectors too:

doc.at('.total_count').text[/\d+/] # => "97"

Next, instead of trying to use gsub to remove what you don't want, use a regular expression to match what you DO want. I repeatedly see code that gets that concept wrong, so make that a mantra. When using a regular expression, if you're trying to find or capture something, use a match. If you're removing or changing stuff use sub or gsub. Very, very, occasionally you'll have to mix the two, but it should be a rare exception.

The current version of Nokogiri (1.6.0) using libxml (2.8.0), on a current version of Ruby (2.0.0) returns the <div> text node:

doc.at('div').text # => "\n    Returned:\u00A097\u00A0results\n"

There is no 4, so if you are seeing anything different then you need to upgrade Ruby, Nokogiri and maybe even your libXML2.

You can check the version information using nokogiri -v at the command-line. You should see something like:


# Nokogiri (1.6.0)
    ---
    warnings: []
    nokogiri: 1.6.0
    ruby:
      version: 2.0.0
      platform: x86_64-darwin12.4.0
      description: ruby 2.0.0p247 (2013-06-27 revision 41674) [x86_64-darwin12.4.0]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: /Users/tinman/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxml2/2.8.0
      libxslt_path: /Users/tinman/.rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/nokogiri-1.6.0/ports/i686-apple-darwin11/libxslt/1.1.26
      compiled: 2.8.0
      loaded: 2.8.0

Unable to strip a stubborn space from a Ruby string (Nokogiri is involved)

I tried to get the same error like you and created this example:

require 'nokogiri'

html = Nokogiri::HTML(<<-html
<td width='400' valign=top>
  <b><u>Jenny ID:</u> 8675309</b><br />
        Name of Place<br />
        Street Address<br />
        City, State, Zip<br />
        Contact: Jenny Jenny<br />
        Phone: 867-5309<br />
        Fax: 
</td>
html
)

el = html.css('b').first
txt = el.content.split(':').last
puts txt    # ' 8675309'
p txt         #"\u00A08675309"
p txt.strip #"\u00A08675309"

The leading character is no space, but \u00A0 (The Unicode Character 'NO-BREAK SPACE' (U+00A0)). It seems strip does not remove it.

If you remove the no-break space explicit, you get the result you want. If you replace \u00A0 with ' ' (a normal space), then you can remove the space with strip without removing it inside the string.

Code:

p txt.gsub("\u00A0", ' ').strip   #-> "8675309"

Alternative you can use (thanks to mu is too short)

p txt.gsub(/\p{Space}/, ' ').strip

This requires UTF-8 code. Without you may get an Encoding::CompatibilityError.

Using Nokogiri, how to convert html to text respecting block elements (ensuring they result in line breaks)

You can use #before and #after to add newlines:

doc.search('p,div,br').each{ |e| e.after "\n" }

How to unescape HTML in Nokogiri Ruby, so & remains & and not &

Use content instead of inner_html to get the content as plain text instead of (X)HTML.

irb(main):011:0> doc.at('head/title').content
=> "Foo & Bar"