Nokogiri text node contents
You want only the text?
doc.search('//text()').map(&:text)
Maybe you don't want all the whitespace and noise. If you want only the text nodes containing a word character,
doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}
Edit: It appears you only wanted the text content of a single node:
some_node.at_xpath( "//whatever" ).text
Search for text nodes in Nokogiri
I have not used Nokogiri, but in standard XPath, you should be able to just use the union operator:
doc.xpath('.//text() | text()')
How to create text.../text node in Nokogiri?
From the docs:
The builder works by taking advantage of method_missing. Unfortunately some methods are defined in ruby that are difficult or dangerous to remove. You may want to create tags with the name “type”, “class”, and “id” for example. In that case, you can use an underscore to disambiguate your tag name from the method call.
Appending an underscore also works for “text”, i.e. use text_
instead:
builder = Nokogiri::XML::Builder.new(:encoding => 'UTF-8') do |xml|
xml.job {
xml.text_ {
xml.cdata 'foo bar baz'
}
}
end
puts builder.to_xml
Output:
<?xml version="1.0" encoding="UTF-8"?>
<job>
<text><![CDATA[foo bar baz]]></text>
</job>
Replacing part of the text in a Nokogiri node while preserving markup in contents
It looks like this works pretty well:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<html>
<head>
<title>Title</title>
<link href="style.css" rel="stylesheet" type="text/css" />
</head>
<body>
<div>
<p class="header"><<2>>Header</p>
<p class="paragraph">
<p class="text_style">Lorem ipsum. <<3>> more content. <span class="style">Preserve this.</span> extra text.</p>
</div>
</body>
</html>
EOT
doc.search("//text()[contains(.,'<<')]").each do |node|
node.replace(node.content.gsub(/<<(\d+)>>/, '<a id="[\1]" />'))
end
Which results in:
puts doc.to_html
# >> <html>
# >> <head>
# >> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
# >> <title>Title</title>
# >> <link href="style.css" rel="stylesheet" type="text/css">
# >> </head>
# >> <body>
# >> <div>
# >> <p class="header"><a id="[2]"></a>Header</p>
# >> <p class="paragraph">
# >> <p class="text_style">Lorem ipsum. <a id="[3]"></a> more content. <span class="style">Preserve this.</span> extra text.</p>
# >> </p>
# >> </div>
# >> </body>
# >> </html>
Nokogiri is adding the
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
line, probably because the markup is defined as XML.
The selector "//text()[contains(.,'<<')]"
is only looking for text nodes containing '<<'
. You might want to modify that to make it more specific if it's possible to result in false positives. See "XPath: using regex in contains function" for the syntax.
replace
is performing the trick; You were trying to modify a Nokogiri::XML::Text node to contain an <a.../>
, but it can't, the <
and >
must be encoded. Changing the node to a Nokogiri::XML::Element, which is what Nokogiri defaults <a id="[2]">
to, lets it store it as you want.
Get text directly inside a tag in Nokogiri
To get all the direct children with text, but not any further sub-children, you can use XPath like so:
doc.xpath('//dt/text()')
Or if you wish to use search:
doc.search('dt').xpath('text()')
Nokogiri: Handling text nodes
In this case you use the text-method on the elements directly.
xml = Nokogiri::XML(File.open('test.xml'))
id = xml.at_css('TestCase ID').text
code = xml.at_css('TestCase Code').text
How to use Nokogiri to get the full HTML without any text content
NOTE: This is a very aggressive approach. Tags like <script>
, <style>
, and <noscript>
also have child text()
nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.
If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:
require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
# Parse HTML
doc = Nokogiri::HTML.parse(html)
puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"
# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }
puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"
How to get node text without children?
XPath includes the text()
node test for selecting text nodes, so you could do:
page.xpath('//p[@class="parent"]/text()')
Using XPath to select HTML classes can become quite tricky if the element in question could belong to more than one class, so this might not be ideal.
Fortunately Nokogiri adds the text()
selector to CSS, so you can use:
page.css('p.parent > text()')
to get the text nodes that are direct children of p.parent
. This will also return some nodes that are whtespace only, so you may have to filter them out.
Related Topics
Cannot Load Such File -- Rack/Handler/Puma
Phonegap Mobile Rails Authentication (Devise? Authentication from Scratch)
Sidekiq Worker Not Getting Triggered
Rails: How to Check If a Column Has a Value
How to Integrate 'Premailer' with Rails
How to I Add a Hyperlink to a Cell in Axlsx
Grabbing Snapshots from Webcams in Ruby
Use Delayed::Job to Manage Multiple Job Queues
How to Override Gemfile for Local Development
Extract All Urls Inside a String in Ruby
Log Doesn't Work in Production with Delayed Job
How to Fix a Slow Implicit Query on Pg_Attribute Table in Rails
What Do 'Def +@' and 'Def -@' Mean
How to Search Array Through Ransack Gem
How to Recursively Remove All Keys with Empty Values from (Yaml) Hash