What are some examples of using Nokogiri?
Using IRB and Ruby 1.9.2:
Load Nokogiri:
> require 'nokogiri'
#=> true
Parse a document:
> doc = Nokogiri::HTML('<html><body><p>foobar</p></body></html>')
#=> #<Nokogiri::HTML::Document:0x1012821a0
@node_cache = [],
attr_accessor :errors = [],
attr_reader :decorators = nil
Nokogiri likes well formed docs. Note that it added the DOCTYPE
because I parsed as a document. It's possible to parse as a document fragment too, but that is pretty specialized.
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foobar</p></body></html>\n"
Search the document to find the first <p>
node using CSS and grab its content:
> doc.at('p').text
#=> "foobar"
Use a different method name to do the same thing:
> doc.at('p').content
#=> "foobar"
Search the document for all <p>
nodes inside the <body>
tag, and grab the content of the first one. search
returns a nodeset, which is like an array of nodes.
> doc.search('body p').first.text
#=> "foobar"
This is an important point, and one that trips up almost everyone when first using Nokogiri. search
and its css
and xpath
variants return a NodeSet. NodeSet.text
or content
concatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again.
Using a little different HTML helps illustrate this:
> doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
> puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>foo</p>
<p>bar</p>
</body></html>
> doc.search('p').text
#=> "foobar"
> doc.search('p').map(&:text)
#=> ["foo", "bar"]
Returning back to the original HTML...
Change the content of the node:
> doc.at('p').content = 'bar'
#=> "bar"
Emit a parsed document as HTML:
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>bar</p></body></html>\n"
Remove a node:
> doc.at('p').remove
#=> #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]>
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n"
As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for "nokogiri and open-uri" should help.
Simple XML parsing example for Nokogiri
The solution can be easier to see if you try it in steps.
Example XML:
<?xml version="1.0" encoding="UTF-8"?>
<xml>
<foo>
<goo>a</goo>
<hoo>b</hoo>
</foo>
<foo>
<goo>c</goo>
<hoo>d</hoo>
</foo>
</xml>
The syntax //foo
selects all the foo
elements.
> puts doc.xpath("//foo")
<foo>
<goo>a</goo>
<hoo>b</hoo>
</foo>
<foo>
<goo>c</goo>
<hoo>d</hoo>
</foo>
Nokogiri returns nodes as a NodeSet
like this:
> puts doc.xpath("//foo").class
Nokogiri::XML::NodeSet
A NodeSet
is enumerable; you can use methods such as each
, map
, etc.
> puts doc.xpath("//foo").kind_of?(Enumerable)
true
This NodeSet
contains two foo
elements:
> doc.xpath("//foo").each{|e| puts e.class }
Nokogiri::XML::Element
Nokogiri::XML::Element
The syntax //foo/*
selects the foo
elements' child elements:
> puts doc.xpath("//foo/*")
<goo>a</goo>
<hoo>b</hoo>
<goo>c</goo>
<hoo>d</hoo>
To print an element's info, see Nokogiri/XML/Node documentation; the two methods you'll likely want are name
and text
.
Solution for you:
> doc.xpath("//foo/*").each{|e|
puts "#{e.name}:#{e.text}"
}
goo:a
hoo:b
goo:c
hoo:d
For your second question, you're essentially asking:
- for each
foo
element, get its child elements - for each child element, print the name and text
Solution for you:
> doc.xpath("//foo").each_with_index{|parent_elem, parent_count|
puts "Parent #{parent_count + 1}"
parent_elem.elements.each{|child_elem|
puts "#{child_elem.name}:#{child_elem.text}"
}
}
XPath along with nokogiri; tutorials/examples?
The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you're looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.
The second trick is to remember to use //
to start your XPath, not /
, unless you're absolutely sure you want to start at the root of the document. //
is like a '**/*'
wildcard at the command-line in Linux. It searches everywhere.
Also, don't trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody
, like you saw. Instead, use Ruby's OpenURI or curl
or wget
to retrieve the raw source, and look at it with an editor like vi
or vim
, or use less
or cat
it to the screen. There's no chance of having any changes to the file that way.
Finally, it's often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that's harder to maintain or more fragile.
Nokogiri itself is pretty easy. The majority of things you'll want to do are simple combinations of two different methods: search
and at
. Both take either a CSS or XPath selector. search
, along with its sibling methods xpath
and css
, return a NodeSet
, which is basically an array of nodes that you can iterate over. at
, css_at
and xpath_at
return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath
variants accept an XPath, and the ...css
ones take a CSS accessor.
Once you have a node, generally you'll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get]
and the text using text
.
Using those methods we can search for all the links in a page and return their text and related href, using something like:
require 'awesome_print'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]
Which outputs:
[
[0] [
[0] "/",
[1] ""
],
[1] [
[0] "/domains/",
[1] "Domains"
],
[2] [
[0] "/numbers/",
[1] "Numbers"
],
[3] [
[0] "/protocols/",
[1] "Protocols"
],
[4] [
[0] "/about/",
[1] "About IANA"
]
]
How to navigate the DOM using Nokogiri
I came across this a few years too late I suppose, but felt compelled to post because all the other solutions are way too complicated.
It's a single statement with XPath:
start = doc.at('div.block#X2')
start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>
start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>
This accommodates either direct previous siblings or children of previous siblings. Regardless of which one matches, the last()
predicate ensures that you get the closest previous match.
In need of an explanation of Web scraping with Nokogiri in Rails
I am hoping to help you with a real world example. Lets get some data from Reuters for example.
In your console try this:
# require your tools make sure you have gem install nokogiri
pry(main)> require 'nokogiri'
pry(main)> require 'open-uri'
# set the url
pry(main)> url = "http://www.reuters.com/finance/stocks/overview?symbol=0005.HK"
# load and assign to a variable
pry(main)> doc = Nokogiri::HTML(open(url))
# take a piece of the site that has an element style .sectionQuote you can use ids also
pry(main)> quote = doc.css(".sectionQuote")
Now if you have a look in quote you will see you will have Nokogiri elements. Lets have a look inside:
pry(main)> quote.size
=> 6
pry(main)> quote.first
=> #(Element:0x43ff468 {
name = "div",
attributes = [ #(Attr:0x43ff404 { name = "class", value = "sectionQuote nasdaqChange" })],
children = [
#(Text "\n\t\t\t"),
#(Element:0x43fef18 {
name = "div",
attributes = [ #(Attr:0x43feeb4 { name = "class", value = "sectionQuoteDetail" })],
children = [
#(Text "\n\t\t\t\t"),
#(Element:0x43fe9c8 { name = "span", attributes = [ #(Attr:0x43fe964 { name = "class", value = "nasdaqChangeHeader" })], children = [ #(Text "0005.HK on Hong Kong Stock")] }),
.....
}),
#(Text "\n\t\t")]
})
You can see that nokogiri has essentially encapsulated each DOM element, so that you can search and access it quickly.
if you want to just simply display this div element you can:
pry(main)> quote.first.to_html
=> "<div class=\"sectionQuote nasdaqChange\">\n\t\t\t<div class=\"sectionQuoteDetail\">\n\t\t\t\t<span class=\"nasdaqChangeHeader\">0005.HK on Hong Kong Stock</span>\n\t\t\t\t<br class=\"clear\"><br class=\"clear\">\n\t\t\t\t<span style=\"font-size: 23px;\">\n\t\t\t\t82.85</span><span>HKD</span><br>\n\t\t\t\t<span class=\"nasdaqChangeTime\">14 Aug 2014</span>\n\t\t\t</div>\n\t\t</div>"
and it is possible to use it directly in the view of a rails application.
if you want to be more specific and take individual components and traverse by looping the quote variable for elements one level down, in this instance you can:
pry(main)> quote.each{|p| puts p.inspect}
Or be very specific and get the value of an element ie the name of the stock in our example:
pry(main)> quote.at_css(".nasdaqChangeHeader").content
=> "0005.HK on Hong Kong Stock"
This is a very useful link: http://nokogiri.org/tutorials/searching_a_xml_html_document.html
Really hope this helps
PS: A tip for looking inside objects
(http://ruby-doc.org/core-2.1.1/Object.html#method-i-inspect)
puts quote.inspect
Nokogiri: how to find a div by id and see what text it contains?
html = <<-HTML
<html>
<body>
<div id="verify" style="display: none;">foobar</div>
</body>
</html>
HTML
doc = Nokogiri::HTML html
puts 'verified!' if doc.at_css('[id="verify"]').text.eql? 'foobar'
How can I use Nokogiri to write a HUGE XML file?
Nokogiri is designed to build in memory because you build a DOM and it converts it to XML on the fly. It's easy to use, but there are trade-offs, and doing it in memory is one of them.
You might want to look into using Erubis to generate the XML. Rather than gather all the data before processing and keeping the logic in a controller, like we'd do with Rails, to save memory you can put your logic in the template and have it iterate over your data, which should help with the resource demands.
If you need the XML in a file you might need to do that using redirection:
erubis options templatefile.erb > xmlfile
This is a very simple example, but it shows you could easily define a template to generate XML:
<%
asdf = (1..5).to_a
%>
<xml>
<element>
<% asdf.each do |i| %>
<subelement><%= i %></subelement>
<% end %>
</element>
</xml>
which, when I call erubis test.erb
outputs:
<xml>
<element>
<subelement>1</subelement>
<subelement>2</subelement>
<subelement>3</subelement>
<subelement>4</subelement>
<subelement>5</subelement>
</element>
</xml>
EDIT:
The string concatenation was taking forever...
Yes, it can simply because of garbage collection. You don't show any code example of how you're building your strings, but Ruby works better when you use <<
to append one string to another than when using +
.
It also might work better to not try to keep everything in a string, but instead to write it immediately to disk, appending to an open file as you go.
Again, without code examples I'm shooting in the dark about what you might be doing or why things run slow.
How to use Nokogiri to get the full HTML without any text content
NOTE: This is a very aggressive approach. Tags like <script>
, <style>
, and <noscript>
also have child text()
nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.
If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:
require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"
# Parse HTML
doc = Nokogiri::HTML.parse(html)
puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"
# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }
puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"
Nokogiri get all HTML nodes
You could split the OuterXml over InnerXml of all opening elements that are not self closing, store the corresponding closing elements if any to retrieve it and parse the document using the Nokogiri reader to build the list according to the order within the document.
It requires that your document is a valid XML fragment as it is using the XML parser and not the HTML one.
require 'nokogiri'
[ "<html><body><h1>Header1</h1></body></html>",
"<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class=\"style\">d</span>olor</p></div></body></html>", <<END
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
END
].each { |string_page|
elem_all = Array.new
elem_ends = Hash.new
reader = Nokogiri::XML::Reader(string_page)
reader.each { |node|
if node.node_type.eql?(1)
if node.self_closing?
elem_all << node.outer_xml
else
elem_tags = node.outer_xml.split(node.inner_xml)
elem_all << elem_tags.first
elem_ends[node.local_name] = elem_tags[1] unless elem_tags.one?
end
end
elem_all << elem_ends[node.local_name] if node.node_type.eql?(15) and elem_ends.has_key?(node.local_name)
}
puts string_page
puts elem_all.to_s
puts
}
Outputs:
<html><body><h1>Header1</h1></body></html>
["<html>", "<body>", "<h1>", "</h1>", "</body>", "</html>"]
<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class="style">d</span>olor</p></div></body></html>
["<html>", "<body>", "<div>", "<h1>", "</h1>", "<hr/>", "</div>", "<div>", "<p>", "<br/>", "<span class=\"style\">", "</span>", "</p>", "</div>", "</body>", "</html>"]
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
["<html>", "<body>", "<h1>", "</h1>", "<p>", "<strong>", "</strong>", "</p>", "</body>", "</html>"]
Related Topics
Zip Up All Paperclip Attachments Stored on S3
3.Days.Ago, 2.Hours.From_Now etc Without Rails
Instance Variables Inheritance
How to Pass Named Arguments to a Rake Task
Initialize Two Variables on Same Line
Rvm Install 1.9.2 Fails When Running Autoconf
How to Calculate Next, Previous Business Day in Rails
Groups in a Gemfile in Rails 3
Ruby on Rails: Two References with Different Name to the Same Model
Ruby String Strip Defined Characters
How to Make Block Local Variables the Default in Ruby 1.9
Error Installing JSON 1.8.3 with Ruby 2.4
Run Selenium with Chrome Driver on Heroku: 'Cannot Find Chrome Binary'