How to Parse Consecutive Tags with Nokogiri

How to parse consecutive tags with Nokogiri?

First of all, your HTML should have the <dt> and <dd> elements inside a <dl>:

<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>

but that won't change how you parse it. You want to find the <dt>s and iterate over them, then at each <dt> you can use next_element to get the <dd>; something like this:

doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end

That should work as long as the structure matches your example.

How do I parse and scrape the meta tags of a URL with Nokogiri?

Here's how I'd go about it:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<meta name="description" content="I design and develop websites and applications.">
<meta name="keywords" content="web designer,web developer">
EOT

contents = %w[description keywords].map { |name|
doc.at("meta[name='#{name}']")['content']
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]

Or:

contents = doc.search("meta[name='description'], meta[name='keywords']").map { |n| 
n['content']
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]

How do I match successive nodes with Nokogiri?

I'd write it something like:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
EOT

found_nodes = doc.search('div.propsBar').map{ |node|
nodes = [node]
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}

(Notice that I stripped out the <input> tags as those only clutter the input HTML. When you supply input data, remove everything that is noise.)

Running that returns the nodes found as an array of arrays. Each sub-array contains the individual nodes found after sequentially walking the sibling chains:

require 'pp'
pp found_nodes
# >> [[#(Element:0x3ff00a4936a0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a037c28 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a49363c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a03629c { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a4935b0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4668f8 { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a49354c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45c808 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a4934e8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45b084 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a49345c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a8710ec { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a4933f8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4979d0 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a493394 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a47e188 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a493308 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a458f00 { name = "class", value = "oddsInfoBottom" })]
# >> })]]

Remember that after parsing, the document is a linked list of nodes. If there is a line-break in the original XML or HTML, there'll be a Text node containing at least a new-line character ("\n"). Because it's a list, we can move forward and backwards using next_sibling and previous_sibling respectively. That makes it really easy to grab little chunks, even if they aren't block tags containing the content you want.

If you want the returned values to resemble the output of a search, css or xpath method, the inner variable nodes will need to change from an Array to a NodeSet:

found_nodes = doc.search('div.propsBar').map{ |node|
nodes = Nokogiri::XML::NodeSet.new(doc, [node])
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}

require 'pp'
pp found_nodes.map(&:to_html)

Running that results in:

# >> ["<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >> "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >> "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>"]

Finally, notice I used CSS selectors rather than XPath. I prefer them because they are usually more readable and succinct. XPath is more powerful and, because it's made for dissecting XML, can often do all the heavy lifting that we'd have to do in Ruby after a CSS selector only gets us close to what we wanted. Use whichever gets the job done for you, with consideration for what is easier to read and maintain.

How to correctly fix unclosed HTML tags with Nokogiri

You could look at using Nokogumbo which attaches Googles’ Gumbo HTML5 parser to Nokogiri. This will then use the HTML5 error correcting algorithms when parsing malformed HTML, rather than the default parsing performed my Nokogiri and libxml, and will result in parsed HTML closer to what you would expect to see from a browser.

Here’s an example irb session showing how it handles your example HTML and produces the result you are after. Note the method name is HTML5, and it is still called on the Nokogiri module.

>> require 'nokogumbo'
=> true
>> s = <<EOT
<div>
<li>
<div>
<div>
test
</div>

<li>
<div>
test
</div>
EOT
=> "<div>\n <li>\n <div>\n <div>\n test\n </div>\n\n <li>\n <div>\n test \n </div>\n"
>> puts Nokogiri.HTML5(s).to_html
<html>
<head></head>
<body><div>
<li>
<div>
<div>
test
</div>

</div>
</li>
<li>
<div>
test
</div>
</li>
</div></body>
</html>
=> nil

How to remove repeated nested tags using Nokogiri

Normally I'm not a huge fan of mutable structures like Nokogiri uses, but in this case I think it works in your advantage. Something like this might work:

def recurse node
# depth first so we don't accidentally modify a collection while
# we're iterating through it.
node.elements.each do |child|
recurse(child)
end

# replace this element's children with it's grandchildren
# assuming it meets all the criteria
if merge_candidate?(node)
node.children = node.elements.first.children
end
end

def merge_candidate? node, name: 'div'
return false unless node.element?
return false unless node.attributes.empty?
return false unless node.name == name
return false unless node.elements.length == 1
return false unless node.elements.first.name == name
return false unless node.elements.first.attributes.empty?

true
end
[18] pry(main)> file = File.read('test.html')
[19] pry(main)> doc = Nokogiri.parse(file)
[20] pry(main)> puts doc
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<div>
<div>
<p>Some text</p>
</div>
</div>
</div>
</body>
</html>
[21] pry(main)> recurse(doc)
[22] pry(main)> puts doc
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<p>Some text</p>
</div>
</body>
</html>
=> nil
[23] pry(main)>

How to parse a HTML table with Nokogiri?

The key of the problem is that calling #text on multiple results will return the concatenation of the #text of each individual element.

Lets examine what each step does:

# Finds all <table>s with class open
# I'm assuming you have only one <table> so
# you don't actually have to loop through
# all tables, instead you can just operate
# on the first one. If that is not the case,
# you can use a loop the way you did
tables = doc.css('table.open')

# The text of all <th>s in <tr> one in the table
title = table.css('tr[1] > th').text

# The text of all <td>s in all <tr>s in the table
# You obviously wanted just the <td>s in one <tr>
cell_data = table.css('tr > td').text

# The text of all <th>s in all <tr>s in the table
# You obviously wanted just the <th>s in one <tr>
raw_name = table.css('tr > th').text

Now that we know what is wrong, here is a possible solution:

html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
<th>Column name 3</th>
<th>Column name 4</th>
<th>Column name 5</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>1001</td>
<td>1002</td>
<td>1003</td>
<td>1004</td>
<td>1005</td>
</tr>
<tr>
<th>Raw name 2</th>
<td>2001</td>
<td>2002</td>
<td>2003</td>
<td>2004</td>
<td>2005</td>
</tr>
<tr>
<th>Raw name 3</th>
<td>3001</td>
<td>3002</td>
<td>3003</td>
<td>3004</td>
<td>3005</td>
</tr>
</table>
EOT

doc = Nokogiri::HTML(html, nil, 'UTF-8')

# Fetches only the first <table>. If you have
# more than one, you can loop the way you
# originally did.
table = doc.css('table.open').first

# Fetches all rows (<tr>s)
rows = table.css('tr')

# The column names are the first row (shift returns
# the first element and removes it from the array).
# On that row we get the text of each individual <th>
# This will be Table name, Column name 1, Column name 2...
column_names = rows.shift.css('th').map(&:text)

# On each of the remaining rows
text_all_rows = rows.map do |row|

# We get the name (<th>)
# On the first row this will be Raw name 1
# on the second - Raw name 2, etc.
row_name = row.css('th').text

# We get the text of each individual value (<td>)
# On the first row this will be 1001, 1002, 1003...
# on the second - 2001, 2002, 2003... etc
row_values = row.css('td').map(&:text)

# We map the name, followed by all the values
[row_name, *row_values]
end

p column_names # => ["Table name", "Column name 1", "Column name 2",
# "Column name 3", "Column name 4", "Column name 5"]
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"],
# ["Raw name 2", "2001", "2002", "2003", "2004", "2005"],
# ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]]

# If you want to combine them
text_all_rows.each do |row_as_text|
p column_names.zip(row_as_text).to_h
end # =>
# {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"}
# {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"}
# {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}

extract links (URLs), with nokogiri in ruby, from a href html tags?

You can do it like this:

doc = Nokogiri::HTML.parse(<<-HTML_END)
<div class="heat">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
<div class="wave">
<a href='http://example.org/site/4/'>site 4</a>
<a href='http://example.org/site/5/'>site 5</a>
<a href='http://example.org/site/6/'>site 6</a>
</div>
HTML_END

l = doc.css('div.heat a').map { |link| link['href'] }

This solution finds all anchor elements using a css selector and collects their href attributes.



Related Topics



Leave a reply



Submit