Nokogiri Text Node Contents

Nokogiri text node contents

You want only the text?

doc.search('//text()').map(&:text)

Maybe you don't want all the whitespace and noise. If you want only the text nodes containing a word character,

doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}

Edit: It appears you only wanted the text content of a single node:

some_node.at_xpath( "//whatever" ).text

Search for text nodes in Nokogiri

I have not used Nokogiri, but in standard XPath, you should be able to just use the union operator:

doc.xpath('.//text() | text()')

How to create text.../text node in Nokogiri?

From the docs:

The builder works by taking advantage of method_missing. Unfortunately some methods are defined in ruby that are difficult or dangerous to remove. You may want to create tags with the name “type”, “class”, and “id” for example. In that case, you can use an underscore to disambiguate your tag name from the method call.

Appending an underscore also works for “text”, i.e. use text_ instead:

builder = Nokogiri::XML::Builder.new(:encoding => 'UTF-8') do |xml|
xml.job {
xml.text_ {
xml.cdata 'foo bar baz'
}
}
end

puts builder.to_xml

Output:

<?xml version="1.0" encoding="UTF-8"?>
<job>
<text><![CDATA[foo bar baz]]></text>
</job>

Replacing part of the text in a Nokogiri node while preserving markup in contents

It looks like this works pretty well:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
<head>
<title>Title</title>
<link href="style.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div>
<p class="header"><<2>>Header</p>
<p class="paragraph">
<p class="text_style">Lorem ipsum. <<3>> more content. <span class="style">Preserve this.</span> extra text.</p>
</div>
</body>
</html>
EOT

doc.search("//text()[contains(.,'<<')]").each do |node|
node.replace(node.content.gsub(/<<(\d+)>>/, '<a id="[\1]" />'))
end

Which results in:

puts doc.to_html

# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
# >> <title>Title</title>
# >> <link href="style.css" rel="stylesheet" type="text/css">
# >> </head>
# >> <body>
# >> <div>
# >> <p class="header"><a id="[2]"></a>Header</p>
# >> <p class="paragraph">
# >> <p class="text_style">Lorem ipsum. <a id="[3]"></a> more content. <span class="style">Preserve this.</span> extra text.</p>
# >> </p>
# >> </div>
# >> </body>
# >> </html>

Nokogiri is adding the

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

line, probably because the markup is defined as XML.

The selector "//text()[contains(.,'<<')]" is only looking for text nodes containing '<<'. You might want to modify that to make it more specific if it's possible to result in false positives. See "XPath: using regex in contains function" for the syntax.

replace is performing the trick; You were trying to modify a Nokogiri::XML::Text node to contain an <a.../>, but it can't, the < and > must be encoded. Changing the node to a Nokogiri::XML::Element, which is what Nokogiri defaults <a id="[2]"> to, lets it store it as you want.

Get text directly inside a tag in Nokogiri

To get all the direct children with text, but not any further sub-children, you can use XPath like so:

doc.xpath('//dt/text()')

Or if you wish to use search:

doc.search('dt').xpath('text()')

Nokogiri: Handling text nodes

In this case you use the text-method on the elements directly.

xml   = Nokogiri::XML(File.open('test.xml'))
id = xml.at_css('TestCase ID').text
code = xml.at_css('TestCase Code').text

How to use Nokogiri to get the full HTML without any text content

NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.

If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:

require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"

# Parse HTML
doc = Nokogiri::HTML.parse(html)

puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"

# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }

puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

How to get node text without children?

XPath includes the text() node test for selecting text nodes, so you could do:

page.xpath('//p[@class="parent"]/text()')

Using XPath to select HTML classes can become quite tricky if the element in question could belong to more than one class, so this might not be ideal.

Fortunately Nokogiri adds the text() selector to CSS, so you can use:

page.css('p.parent > text()')

to get the text nodes that are direct children of p.parent. This will also return some nodes that are whtespace only, so you may have to filter them out.



Related Topics



Leave a reply



Submit