Modifying Text Inside HTML Nodes - Nokogiri

Modifying text inside html nodes - nokogiri

What about this code?

doc.traverse do |x|
if x.text?
x.content = x.content.gsub(/(?<=[.!?])(?!\*)/, "#{$1}*")
end
end

The traverse method does pretty much the same as search("*").each. Then you check that the node is a Nokogiri::XML::Text and, if so, change the content as you wished.

Replacing part of the text in a Nokogiri node while preserving markup in contents

It looks like this works pretty well:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
<head>
<title>Title</title>
<link href="style.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div>
<p class="header"><<2>>Header</p>
<p class="paragraph">
<p class="text_style">Lorem ipsum. <<3>> more content. <span class="style">Preserve this.</span> extra text.</p>
</div>
</body>
</html>
EOT

doc.search("//text()[contains(.,'<<')]").each do |node|
node.replace(node.content.gsub(/<<(\d+)>>/, '<a id="[\1]" />'))
end

Which results in:

puts doc.to_html

# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
# >> <title>Title</title>
# >> <link href="style.css" rel="stylesheet" type="text/css">
# >> </head>
# >> <body>
# >> <div>
# >> <p class="header"><a id="[2]"></a>Header</p>
# >> <p class="paragraph">
# >> <p class="text_style">Lorem ipsum. <a id="[3]"></a> more content. <span class="style">Preserve this.</span> extra text.</p>
# >> </p>
# >> </div>
# >> </body>
# >> </html>

Nokogiri is adding the

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

line, probably because the markup is defined as XML.

The selector "//text()[contains(.,'<<')]" is only looking for text nodes containing '<<'. You might want to modify that to make it more specific if it's possible to result in false positives. See "XPath: using regex in contains function" for the syntax.

replace is performing the trick; You were trying to modify a Nokogiri::XML::Text node to contain an <a.../>, but it can't, the < and > must be encoded. Changing the node to a Nokogiri::XML::Element, which is what Nokogiri defaults <a id="[2]"> to, lets it store it as you want.

How do I modify the node content with Nokogiri?

Nokogiri will return a NodeSet to an xpath query (also search and css). This is an Enumerable object of Nodes

If you know your element is the only one:

recipename = @page.xpath("//body/h1").first

Or you can loop through the NodeSet with .each if needed

recipename = @page.xpath("//body/h1")
recipename.each do |node|
puts node.content
end

Getting text only when nokogiri certain HTML structure

I would delete the other nodes that are in this section if you're not using the document any further.

nokogiri_object.css("div.line1 *").each(&:remove)
nokogiri_object.at_css("div.line1").text.strip # => "text I need"

Editing Text in a Nokogiri Element or Using Regex

#!/usr/bin/ruby1.8

require 'rubygems'
require 'nokogiri'

html = <<EOS
<ul>
<li>: blah blah blah</li>
<li>: foo bar baz</li>
</ul>
EOS

doc = Nokogiri::HTML.parse(html)
for li in doc.xpath('//li/text()')
li.content = li.content.gsub(/^: */, '')
end
puts doc.to_html

# => <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# => <html><body><ul>
# => <li>blah blah blah</li>
# => <li>foo bar baz</li>
# => </ul></body></html>

How to avoid joining all text from Nodes when scraping

This is an easily solved problem that results from not reading the documentation about how text behaves when used on a NodeSet versus a Node (or Element).

The NodeSet documentation says text will:

Get the inner text of all contained Node objects

Which is what we're seeing happen with:

doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT

doc.search('p').text # => "foobarbaz"

because:

doc.search('p').class # => Nokogiri::XML::NodeSet

Instead, we want to get each Node and extract its text:

doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"

which can be done using map:

doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]

Ruby allows us to write that more concisely using:

doc.search('p').map(&:text) # => ["foo", "bar", "baz"]

The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.

A Node has several aliased methods for getting at its embedded text. From the documentation:

#content ⇒ Object

Also known as: text, inner_text

Returns the contents for this Node.



Related Topics



Leave a reply



Submit