Modifying text inside html nodes - nokogiri
What about this code?
doc.traverse do |x|
if x.text?
x.content = x.content.gsub(/(?<=[.!?])(?!\*)/, "#{$1}*")
end
end
The traverse
method does pretty much the same as search("*").each
. Then you check that the node is a Nokogiri::XML::Text
and, if so, change the content
as you wished. Replacing part of the text in a Nokogiri node while preserving markup in contents
It looks like this works pretty well:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
<head>
<title>Title</title>
<link href="style.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div>
<p class="header"><<2>>Header</p>
<p class="paragraph">
<p class="text_style">Lorem ipsum. <<3>> more content. <span class="style">Preserve this.</span> extra text.</p>
</div>
</body>
</html>
EOT
doc.search("//text()[contains(.,'<<')]").each do |node|
node.replace(node.content.gsub(/<<(\d+)>>/, '<a id="[\1]" />'))
end
Which results in:puts doc.to_html
# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
# >> <title>Title</title>
# >> <link href="style.css" rel="stylesheet" type="text/css">
# >> </head>
# >> <body>
# >> <div>
# >> <p class="header"><a id="[2]"></a>Header</p>
# >> <p class="paragraph">
# >> <p class="text_style">Lorem ipsum. <a id="[3]"></a> more content. <span class="style">Preserve this.</span> extra text.</p>
# >> </p>
# >> </div>
# >> </body>
# >> </html>
Nokogiri is adding the <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
line, probably because the markup is defined as XML.The selector "//text()[contains(.,'<<')]"
is only looking for text nodes containing '<<'
. You might want to modify that to make it more specific if it's possible to result in false positives. See "XPath: using regex in contains function" for the syntax.
replace
is performing the trick; You were trying to modify a Nokogiri::XML::Text node to contain an <a.../>
, but it can't, the <
and >
must be encoded. Changing the node to a Nokogiri::XML::Element, which is what Nokogiri defaults <a id="[2]">
to, lets it store it as you want.
How do I modify the node content with Nokogiri?
Nokogiri will return a NodeSet
to an xpath
query (also search
and css
). This is an Enumerable object of Node
s
If you know your element is the only one:
recipename = @page.xpath("//body/h1").first
Or you can loop through the NodeSet with .each
if neededrecipename = @page.xpath("//body/h1")
recipename.each do |node|
puts node.content
end
Getting text only when nokogiri certain HTML structure
I would delete the other nodes that are in this section if you're not using the document any further.
nokogiri_object.css("div.line1 *").each(&:remove)
nokogiri_object.at_css("div.line1").text.strip # => "text I need"
Editing Text in a Nokogiri Element or Using Regex
#!/usr/bin/ruby1.8
require 'rubygems'
require 'nokogiri'
html = <<EOS
<ul>
<li>: blah blah blah</li>
<li>: foo bar baz</li>
</ul>
EOS
doc = Nokogiri::HTML.parse(html)
for li in doc.xpath('//li/text()')
li.content = li.content.gsub(/^: */, '')
end
puts doc.to_html
# => <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# => <html><body><ul>
# => <li>blah blah blah</li>
# => <li>foo bar baz</li>
# => </ul></body></html>
How to avoid joining all text from Nodes when scraping
This is an easily solved problem that results from not reading the documentation about how text
behaves when used on a NodeSet versus a Node (or Element).
The NodeSet documentation says text
will:
Get the inner text of all contained Node objectsWhich is what we're seeing happen with:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
because:doc.search('p').class # => Nokogiri::XML::NodeSet
Instead, we want to get each Node and extract its text:doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"
which can be done using map
:doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]
Ruby allows us to write that more concisely using:doc.search('p').map(&:text) # => ["foo", "bar", "baz"]
The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.A Node has several aliased methods for getting at its embedded text. From the documentation:
#content ⇒ Object
Also known as:
text
,inner_text
Returns the contents for this Node.
Related Topics
Gem Ransack Doesn't Return Any Results When Searched with Full Name
Include Module in All Minitest Tests Like in Rspec
Crontab Not Running Ruby Script
How to Access Sinatra App on Host Machine with Vagrant Forwarded Ports
Including Methods to a Controller from a Plugin
Inspect or Clean Up The Working Tree Error When Installing Ruby 2.1.3 on MAC Os X 10.9.5
How to Find Out What Is Intercepting 'Method_Missing'
Nlp to Classify/Label The Content of a Sentence (Ruby Binding Necesarry)
How to Save Data with Has_Many: Through
Multistep Form with Activeadmin
How to Stub/Mock a Call to The Command Line with Rspec
Stripping Commas from Integers or Decimals in Rails
Unit Testing Code Which Gets Current Time
Rails + Mongoid - Don't Return Nil Values in JSON
How to Compare Xml Output in a Cucumber Step Using a Multiline String Example