Nokogiri Recursively Get All Children

Nokogiri recursively get all children

the traverse method yields the current node and all children to a block, recursively.

# if you would like it to be returned as an array, rather than each node being yielded to a block, you can do this
result = []
doc.traverse {|node| result << node }
result

# or,
require 'enumerator'
result = doc.enum_for(:traverse).map

Nokogiri, fetch all classes from page

#classes returns only classes of the node itself. It doesn't deal with the child nodes. You need to scan all the child nodes recursively.

require 'nokogiri'

def flatten_dom(node)
node.elements.flat_map { |child| flatten(child) } << node
end

page = Nokogiri::HTML.parse('<html><body class="a b"><b class="c">x</b></body></html>')

flatten(page).flat_map(&:classes)
# => ["c", "a", "b"]

You may also want to add .uniq in order to get rid of the duplicates.

Accessing Nokogiri element children

name, number, date, title = *content[1].css('td').map(&:text)

if content[1] is a tr, content[1].css('td') will find all td elements beneath it, .map(&:text) will call td.text for each of those td and put it into an array, which we than splat with * so we can do multiple assignment.

(Note: next time, please include the original HTML fragment, not the Nokogiri node inspect result.)

I can't get Nokogiri to loop through children nodes

Change this line of code:

content_options = xmldoc.xpath("//content_options")

to this:

content_options = xmldoc.xpath("//content_option")

Of course it will only show you one entry; in your XML, there's only one content_options element, and there's 2 content_option elements.

find first level children in nokogiri rails

When you say this:

table = page.css('table')

you're grabbing both tables rather than just the top level table. So you can either go back to the document root and use a selector that only matches the rows in the first table as mosch says or you can fix table to be only the outer table with something like this:

table = page.css('table').first
trs = table.xpath('./tr')

or even this (depending on the HTML's real structure):

table = page.xpath('/html/body/table')
trs = table.xpath('./tr')

or perhaps one of these for table (thanks Phrogz, again):

table = page.at('table')
table = page.at_css('table')
# or various other CSS and XPath incantations

Nokogiri: Merge neighbour text nodes recursively?

Okay, finally I got it right myself:

def merge_text_nodes(node)
prev_is_text = false

newnodes = []
node.children.each do |element|
if element.text?
if prev_is_text
newnodes[-1].content += element.text
else
newnodes << element
end
element.remove
prev_is_text = true
else
newnodes << merge_text_nodes(element)
element.remove
prev_is_text = false
end
end

node.children.remove
newnodes.each do |item|
node.add_child(item)
end

return node
end

How do I find direct children and not nested children using Rails and Nokogiri?

You can do it in a couple of steps using XPath. First you need to find the “level” of the table (i.e. how nested it is in other tables), then find all descendant tr that have the same number of table ancestors:

tables = doc.xpath('//table')
tables.each do |table|
level = table.xpath('count(ancestor-or-self::table)')
rows = table.xpath(".//tr[count(ancestor::table) = #{level}]")
# do what you want with rows...
end

In the more general case, where you might have tr nested directly other trs, you could do something like this (this would be invalid HTML, but you might have XML or some other tags):

tables.each do |table|
# Find the first descendant tr, and determine its level. This
# will be a "top-level" tr for this table. "level" here means how
# many tr elements (including itself) are between it and the
# document root.
level = table.xpath("count(descendant::tr[1]/ancestor-or-self::tr)")
# Now find all descendant trs that have that same level. Since
# the table itself is at a fixed level, this means all these nodes
# will be "top-level" rows for this table.
rows = table.xpath(".//tr[count(ancestor-or-self::tr) = #{level}]")
# handle rows...
end

The first step could be broken into two separate queries, which may be clearer:

first_tr = table.at_xpath(".//tr")
level = first_tr.xpath("count(ancestor-or-self::tr)")

(This will fail if there is a table with no trs though, as first_tr will be nil. The combined XPath above handles that situation correctly.)



Related Topics



Leave a reply



Submit