How to parse consecutive tags with Nokogiri?
First of all, your HTML should have the <dt>
and <dd>
elements inside a <dl>
:
<div id="first">
<dl>
<dt>Label1</dt>
<dd>Value1</dd>
<dt>Label2</dt>
<dd>Value2</dd>
...
</dl>
</div>
but that won't change how you parse it. You want to find the <dt>
s and iterate over them, then at each <dt>
you can use next_element
to get the <dd>
; something like this:
doc = Nokogiri::HTML('<div id="first"><dl>...')
doc.css('#first').search('dt').each do |node|
puts "#{node.text}: #{node.next_element.text}"
end
That should work as long as the structure matches your example.
How do I parse and scrape the meta tags of a URL with Nokogiri?
Here's how I'd go about it:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<meta name="description" content="I design and develop websites and applications.">
<meta name="keywords" content="web designer,web developer">
EOT
contents = %w[description keywords].map { |name|
doc.at("meta[name='#{name}']")['content']
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]
Or:
contents = doc.search("meta[name='description'], meta[name='keywords']").map { |n|
n['content']
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]
How do I match successive nodes with Nokogiri?
I'd write it something like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
<div class="propsBar"></div>
<div class="oddsInfoTop"></div>
<div class="oddsInfoBottom"></div>
<!--
BUY POINTS
-->
<div class="timeBar"></div>
EOT
found_nodes = doc.search('div.propsBar').map{ |node|
nodes = [node]
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}
(Notice that I stripped out the <input>
tags as those only clutter the input HTML. When you supply input data, remove everything that is noise.)
Running that returns the nodes found as an array of arrays. Each sub-array contains the individual nodes found after sequentially walking the sibling chains:
require 'pp'
pp found_nodes
# >> [[#(Element:0x3ff00a4936a0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a037c28 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a49363c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a03629c { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a4935b0 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4668f8 { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a49354c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45c808 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a4934e8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a45b084 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a49345c {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a8710ec { name = "class", value = "oddsInfoBottom" })]
# >> })],
# >> [#(Element:0x3ff00a4933f8 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a4979d0 { name = "class", value = "propsBar" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a493394 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a47e188 { name = "class", value = "oddsInfoTop" })]
# >> }),
# >> #(Text "\n"),
# >> #(Element:0x3ff00a493308 {
# >> name = "div",
# >> attributes = [
# >> #(Attr:0x3ff00a458f00 { name = "class", value = "oddsInfoBottom" })]
# >> })]]
Remember that after parsing, the document is a linked list of nodes. If there is a line-break in the original XML or HTML, there'll be a Text node containing at least a new-line character ("\n
"). Because it's a list, we can move forward and backwards using next_sibling
and previous_sibling
respectively. That makes it really easy to grab little chunks, even if they aren't block tags containing the content you want.
If you want the returned values to resemble the output of a search
, css
or xpath
method, the inner variable nodes
will need to change from an Array to a NodeSet:
found_nodes = doc.search('div.propsBar').map{ |node|
nodes = Nokogiri::XML::NodeSet.new(doc, [node])
loop do
node = node.next_sibling
nodes << node
break if node['class'] == 'oddsInfoBottom'
end
nodes
}
require 'pp'
pp found_nodes.map(&:to_html)
Running that results in:
# >> ["<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >> "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>",
# >> "<div class=\"propsBar\"></div>\n<div class=\"oddsInfoTop\"></div>\n<div class=\"oddsInfoBottom\"></div>"]
Finally, notice I used CSS selectors rather than XPath. I prefer them because they are usually more readable and succinct. XPath is more powerful and, because it's made for dissecting XML, can often do all the heavy lifting that we'd have to do in Ruby after a CSS selector only gets us close to what we wanted. Use whichever gets the job done for you, with consideration for what is easier to read and maintain.
How to correctly fix unclosed HTML tags with Nokogiri
You could look at using Nokogumbo which attaches Googles’ Gumbo HTML5 parser to Nokogiri. This will then use the HTML5 error correcting algorithms when parsing malformed HTML, rather than the default parsing performed my Nokogiri and libxml, and will result in parsed HTML closer to what you would expect to see from a browser.
Here’s an example irb
session showing how it handles your example HTML and produces the result you are after. Note the method name is HTML5
, and it is still called on the Nokogiri
module.
>> require 'nokogumbo'
=> true
>> s = <<EOT
<div>
<li>
<div>
<div>
test
</div>
<li>
<div>
test
</div>
EOT
=> "<div>\n <li>\n <div>\n <div>\n test\n </div>\n\n <li>\n <div>\n test \n </div>\n"
>> puts Nokogiri.HTML5(s).to_html
<html>
<head></head>
<body><div>
<li>
<div>
<div>
test
</div>
</div>
</li>
<li>
<div>
test
</div>
</li>
</div></body>
</html>
=> nil
How to remove repeated nested tags using Nokogiri
Normally I'm not a huge fan of mutable structures like Nokogiri uses, but in this case I think it works in your advantage. Something like this might work:
def recurse node
# depth first so we don't accidentally modify a collection while
# we're iterating through it.
node.elements.each do |child|
recurse(child)
end
# replace this element's children with it's grandchildren
# assuming it meets all the criteria
if merge_candidate?(node)
node.children = node.elements.first.children
end
end
def merge_candidate? node, name: 'div'
return false unless node.element?
return false unless node.attributes.empty?
return false unless node.name == name
return false unless node.elements.length == 1
return false unless node.elements.first.name == name
return false unless node.elements.first.attributes.empty?
true
end
[18] pry(main)> file = File.read('test.html')
[19] pry(main)> doc = Nokogiri.parse(file)
[20] pry(main)> puts doc
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<div>
<div>
<p>Some text</p>
</div>
</div>
</div>
</body>
</html>
[21] pry(main)> recurse(doc)
[22] pry(main)> puts doc
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div>
<p>Some text</p>
</div>
</body>
</html>
=> nil
[23] pry(main)>
How to parse a HTML table with Nokogiri?
The key of the problem is that calling #text
on multiple results will return the concatenation of the #text
of each individual element.
Lets examine what each step does:
# Finds all <table>s with class open
# I'm assuming you have only one <table> so
# you don't actually have to loop through
# all tables, instead you can just operate
# on the first one. If that is not the case,
# you can use a loop the way you did
tables = doc.css('table.open')
# The text of all <th>s in <tr> one in the table
title = table.css('tr[1] > th').text
# The text of all <td>s in all <tr>s in the table
# You obviously wanted just the <td>s in one <tr>
cell_data = table.css('tr > td').text
# The text of all <th>s in all <tr>s in the table
# You obviously wanted just the <th>s in one <tr>
raw_name = table.css('tr > th').text
Now that we know what is wrong, here is a possible solution:
html = <<EOT
<table class="open">
<tr>
<th>Table name</th>
<th>Column name 1</th>
<th>Column name 2</th>
<th>Column name 3</th>
<th>Column name 4</th>
<th>Column name 5</th>
</tr>
<tr>
<th>Raw name 1</th>
<td>1001</td>
<td>1002</td>
<td>1003</td>
<td>1004</td>
<td>1005</td>
</tr>
<tr>
<th>Raw name 2</th>
<td>2001</td>
<td>2002</td>
<td>2003</td>
<td>2004</td>
<td>2005</td>
</tr>
<tr>
<th>Raw name 3</th>
<td>3001</td>
<td>3002</td>
<td>3003</td>
<td>3004</td>
<td>3005</td>
</tr>
</table>
EOT
doc = Nokogiri::HTML(html, nil, 'UTF-8')
# Fetches only the first <table>. If you have
# more than one, you can loop the way you
# originally did.
table = doc.css('table.open').first
# Fetches all rows (<tr>s)
rows = table.css('tr')
# The column names are the first row (shift returns
# the first element and removes it from the array).
# On that row we get the text of each individual <th>
# This will be Table name, Column name 1, Column name 2...
column_names = rows.shift.css('th').map(&:text)
# On each of the remaining rows
text_all_rows = rows.map do |row|
# We get the name (<th>)
# On the first row this will be Raw name 1
# on the second - Raw name 2, etc.
row_name = row.css('th').text
# We get the text of each individual value (<td>)
# On the first row this will be 1001, 1002, 1003...
# on the second - 2001, 2002, 2003... etc
row_values = row.css('td').map(&:text)
# We map the name, followed by all the values
[row_name, *row_values]
end
p column_names # => ["Table name", "Column name 1", "Column name 2",
# "Column name 3", "Column name 4", "Column name 5"]
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"],
# ["Raw name 2", "2001", "2002", "2003", "2004", "2005"],
# ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]]
# If you want to combine them
text_all_rows.each do |row_as_text|
p column_names.zip(row_as_text).to_h
end # =>
# {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"}
# {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"}
# {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"}
extract links (URLs), with nokogiri in ruby, from a href html tags?
You can do it like this:
doc = Nokogiri::HTML.parse(<<-HTML_END)
<div class="heat">
<a href='http://example.org/site/1/'>site 1</a>
<a href='http://example.org/site/2/'>site 2</a>
<a href='http://example.org/site/3/'>site 3</a>
</div>
<div class="wave">
<a href='http://example.org/site/4/'>site 4</a>
<a href='http://example.org/site/5/'>site 5</a>
<a href='http://example.org/site/6/'>site 6</a>
</div>
HTML_END
l = doc.css('div.heat a').map { |link| link['href'] }
This solution finds all anchor elements using a css selector and collects their href attributes.
Related Topics
Subtract Dates in Ruby and Get the Difference in Minutes
Converting Nested Hash Keys from Camelcase to Snake_Case in Ruby
Error Installing a Pod - Bus Error at 0X00000001045B8000
Why Do You Need "Require 'Bundler/Setup'"
Rails 3 Validates Inclusion of When Using a Find (How to Proc or Lambda)
Dynamically Define Named Classes in Ruby
How to Uppercase Each Element of an Array
How to Change the Zone Offset for a Time in Ruby on Rails
Why Relative Path Doesn't Work in Ruby Require
Nameerror: Uninitialized Constant Faker
Using Calendar_Date_Select with Rails 3
How to Start the Ruby Debugger on Exception