Get Text Directly Inside a Tag in Nokogiri

Get text directly inside a tag in Nokogiri

To get all the direct children with text, but not any further sub-children, you can use XPath like so:

doc.xpath('//dt/text()')

Or if you wish to use search:

doc.search('dt').xpath('text()')

How do I get all the text within a tag using a Nokogiri CSS selector?

It looks like you can simply call the #text method of the target element and it will include all child text nodes:

doc = Nokogiri::HTML(your_html_snippet)
str = doc.css('td').text
str # => "\n\nsome text\n\n\nmore text\n\n"

Nokogiri: Get text which is not inside the a tag

This is straightforward using XPath and the text() node test. If you have extracted the lis into nodeset, you can get the text with:

nodeset.xpath('./text()')

Or you can get it directly from the whole doc:

doc.xpath('//li/text()')

This uses the text() node test as part of te XPath expression, not the text Ruby method. It extracts any text nodes that are direct descendants of the li node, so doesn’t include the contents of the a element.

Getting text only when nokogiri certain HTML structure

I would delete the other nodes that are in this section if you're not using the document any further.

nokogiri_object.css("div.line1 *").each(&:remove)
nokogiri_object.at_css("div.line1").text.strip # => "text I need"

Get content after header tag with Nokogiri

You can get ul elements after h4 using following-sibling:

require 'nokogiri'

html = <<-EOF
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li><a href="//auburn.craigslist.org/">auburn</a></li>
<li><a href="//bham.craigslist.org/">birmingham</a></li>
<li><a href="//dothan.craigslist.org/">dothan</a></li>
<li><a href="//shoals.craigslist.org/">florence / muscle shoals</a></li>
<li><a href="//gadsden.craigslist.org/">gadsden-anniston</a></li>
<li><a href="//huntsville.craigslist.org/">huntsville / decatur</a></li>
<li><a href="//mobile.craigslist.org/">mobile</a></li>
<li><a href="//montgomery.craigslist.org/">montgomery</a></li>
<li><a href="//tuscaloosa.craigslist.org/">tuscaloosa</a></li>
</ul>
<h4>Alaska</h4>
<ul>
<li><a href="//anchorage.craigslist.org/">anchorage / mat-su</a></li>
<li><a href="//fairbanks.craigslist.org/">fairbanks</a></li>
<li><a href="//kenai.craigslist.org/">kenai peninsula</a></li>
<li><a href="//juneau.craigslist.org/">southeast alaska</a></li>
</ul>
EOF

doc = Nokogiri::HTML(html)
doc.xpath('//h4/following-sibling::ul').each do |node|
  puts node.to_html
end

To select ul after an h4 with exact text:

puts doc.xpath("//h4[text()='Alabama']/following-sibling::ul")[0].to_html

Get Text Directly Inside a Tag in Nokogiri