I Can't Remove Whitespaces from a String Parsed by Nokogiri

I can't remove whitespaces from a string parsed by Nokogiri

strip only removes ASCII whitespace and the character you've got here is a Unicode non-breaking space.

Removing the character is easy. You can use gsub by providing a regex with the character code:

gsub(/\u00a0/, '')

You could also call

gsub(/[[:space:]]/, '')

to remove all Unicode whitespace. For details, check the Regexp documentation.

I can't remove white spaces from nokogiri node content

Two things to try:

If you're checking the population variable, your method doesn't actually put the substitution in it. Change the last line to:

population << value.gsub(/\s+/, "")

If that still doesn't work, perhaps there is some non-space character that looks like a space in your terminal? Try replacing non-digits instead:

population << value.gsub(/\D/, "")

How to remove white space from HTML text

Consider this:

require 'nokogiri'

doc = Nokogiri::HTML('<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>')
doc.search('div').text
# => "Kühlungsborner Straße\n 10\n "
puts doc.search('div').text

# >> Kühlungsborner Straße
# >> 10
# >>

The given HTML doesn't replicate the problem you're having. It's really important to present valid input that duplicates the problem. Moving on....

Don't use xpath, css or search with text. You usually won't get what you expect:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<span>foo</span>
<span>bar</span>
</div>
</body>
</html>
EOT

doc.search('span').class # => Nokogiri::XML::NodeSet
doc.search('span') # => [#<Nokogiri::XML::Element:0x3fdb6981bcd8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981b5d0 "foo">]>, #<Nokogiri::XML::Element:0x3fdb6981aab8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981a054 "bar">]>]

doc.search('span').text
# => "foobar"

Note that text returned the concatenated text of all nodes found.

Instead, walk the NodeSet and grab the individual node's text:

doc.search('span').map(&:text)
# => ["foo", "bar"]

Using nokogiri how do I remove all elements with a certain classname

Should be:

doc.css('a.target').remove
puts doc.at('html').to_s

Rails - strip xml import from whitespace and line break

You could use XSLT to remove all the unnecessary characters.

remove whitespace from xml document using ruby

Following should give you what you are looking for

string.gsub(/\\n/, '').gsub(/>\s*/, ">").gsub(/\s*</, "<")

How to remove a node using Nokogiri

1st problem

To remove all the script nodes :

require 'nokogiri'

html = "<div>
This is
<p> very
<script>
some code
</script>
</p>
important.
</div>"

doc = Nokogiri::HTML(html)

doc.xpath("//script").remove

p doc.text
#=> "\n This is\n very\n \n \n important.\n"

Thanks to @theTinMan for his tip (calling remove on one NodeSet instead of each Node).

2nd problem

To remove the unneeded whitespaces, you can use :

  • strip to remove spaces (whitespace, tabs, newlines, ...) at beginning and end of string
  • gsub to replace mutiple spaces by just one whitespace


p doc.text.strip.gsub(/[[:space:]]+/,' ')
#=> "This is very important."


Related Topics



Leave a reply



Submit