What Are Some Examples of Using Nokogiri

What are some examples of using Nokogiri?

Using IRB and Ruby 1.9.2:

Load Nokogiri:

> require 'nokogiri'
#=> true

Parse a document:

> doc = Nokogiri::HTML('<html><body><p>foobar</p></body></html>')
#=> #<Nokogiri::HTML::Document:0x1012821a0
@node_cache = [],
attr_accessor :errors = [],
attr_reader :decorators = nil

Nokogiri likes well formed docs. Note that it added the DOCTYPE because I parsed as a document. It's possible to parse as a document fragment too, but that is pretty specialized.

> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foobar</p></body></html>\n"

Search the document to find the first <p> node using CSS and grab its content:

> doc.at('p').text
#=> "foobar"

Use a different method name to do the same thing:

> doc.at('p').content
#=> "foobar"

Search the document for all <p> nodes inside the <body> tag, and grab the content of the first one. search returns a nodeset, which is like an array of nodes.

> doc.search('body p').first.text
#=> "foobar"

This is an important point, and one that trips up almost everyone when first using Nokogiri. search and its css and xpath variants return a NodeSet. NodeSet.text or content concatenates the text of all the returned nodes into a single String which can make it very difficult to take apart again.

Using a little different HTML helps illustrate this:

> doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
> puts doc.to_html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>foo</p>
<p>bar</p>
</body></html>

> doc.search('p').text
#=> "foobar"

> doc.search('p').map(&:text)
#=> ["foo", "bar"]

Returning back to the original HTML...

Change the content of the node:

> doc.at('p').content = 'bar'
#=> "bar"

Emit a parsed document as HTML:

> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>bar</p></body></html>\n"

Remove a node:

> doc.at('p').remove
#=> #<Nokogiri::XML::Element:0x80939178 name="p" children=[#<Nokogiri::XML::Text:0x8091a624 "bar">]>
> doc.to_html
#=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body></body></html>\n"

As for scraping, there are a lot of questions on SO about using Nokogiri for tearing apart HTML from sites. Searching StackOverflow for "nokogiri and open-uri" should help.

Simple XML parsing example for Nokogiri

The solution can be easier to see if you try it in steps.

Example XML:

<?xml version="1.0" encoding="UTF-8"?>
<xml>
<foo>
<goo>a</goo>
<hoo>b</hoo>
</foo>
<foo>
<goo>c</goo>
<hoo>d</hoo>
</foo>
</xml>

The syntax //foo selects all the foo elements.

> puts doc.xpath("//foo")
<foo>
<goo>a</goo>
<hoo>b</hoo>
</foo>
<foo>
<goo>c</goo>
<hoo>d</hoo>
</foo>

Nokogiri returns nodes as a NodeSet like this:

> puts doc.xpath("//foo").class
Nokogiri::XML::NodeSet

A NodeSet is enumerable; you can use methods such as each, map, etc.

> puts doc.xpath("//foo").kind_of?(Enumerable)
true

This NodeSet contains two foo elements:

> doc.xpath("//foo").each{|e| puts e.class }
Nokogiri::XML::Element
Nokogiri::XML::Element

The syntax //foo/* selects the foo elements' child elements:

> puts doc.xpath("//foo/*")
<goo>a</goo>
<hoo>b</hoo>
<goo>c</goo>
<hoo>d</hoo>

To print an element's info, see Nokogiri/XML/Node documentation; the two methods you'll likely want are name and text.

Solution for you:

> doc.xpath("//foo/*").each{|e|
puts "#{e.name}:#{e.text}"
}
goo:a
hoo:b
goo:c
hoo:d

For your second question, you're essentially asking:

  1. for each foo element, get its child elements
  2. for each child element, print the name and text

Solution for you:

> doc.xpath("//foo").each_with_index{|parent_elem, parent_count| 
puts "Parent #{parent_count + 1}"
parent_elem.elements.each{|child_elem|
puts "#{child_elem.name}:#{child_elem.text}"
}
}

XPath along with nokogiri; tutorials/examples?

The biggest trick to finding an element, or group of elements, using Nokogiri or any XML/HTML parser, is to start with a short accessor to get into the general vicinity of what you're looking for, then iteratively add to it, fine-tuning as you go, until you have what you want.

The second trick is to remember to use // to start your XPath, not /, unless you're absolutely sure you want to start at the root of the document. // is like a '**/*' wildcard at the command-line in Linux. It searches everywhere.

Also, don't trust the XPath or CSS accessor provided by a browser. They do all sorts of fixups to the HTML source, including tbody, like you saw. Instead, use Ruby's OpenURI or curl or wget to retrieve the raw source, and look at it with an editor like vi or vim, or use less or cat it to the screen. There's no chance of having any changes to the file that way.

Finally, it's often easier/faster to break the search into chunks with XPath, then let Ruby iterate through things, than to try to come up with a complex XPath that's harder to maintain or more fragile.

Nokogiri itself is pretty easy. The majority of things you'll want to do are simple combinations of two different methods: search and at. Both take either a CSS or XPath selector. search, along with its sibling methods xpath and css, return a NodeSet, which is basically an array of nodes that you can iterate over. at, css_at and xpath_at return the first node that matches the CSS or XPath accessor. In all those methods, the ...xpath variants accept an XPath, and the ...css ones take a CSS accessor.

Once you have a node, generally you'll want to do one of two things to it, either extract a parameter or get its text/content. You can easily get the attributes using [attribute_to_get] and the text using text.

Using those methods we can search for all the links in a page and return their text and related href, using something like:

require 'awesome_print'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.example.com'))
ap doc.search('a').map{ |a| [a['href'], a.text] }[0, 5]

Which outputs:

[
[0] [
[0] "/",
[1] ""
],
[1] [
[0] "/domains/",
[1] "Domains"
],
[2] [
[0] "/numbers/",
[1] "Numbers"
],
[3] [
[0] "/protocols/",
[1] "Protocols"
],
[4] [
[0] "/about/",
[1] "About IANA"
]
]

How to navigate the DOM using Nokogiri

I came across this a few years too late I suppose, but felt compelled to post because all the other solutions are way too complicated.

It's a single statement with XPath:

start = doc.at('div.block#X2')

start.at_xpath('(preceding-sibling::h1 | preceding-sibling::*//h1)[last()]')
#=> <h2>Foo</h2>

start.at_xpath('(preceding-sibling::h2 | preceding-sibling::*//h2)[last()]')
#=> <h2>Bar</h2>

This accommodates either direct previous siblings or children of previous siblings. Regardless of which one matches, the last() predicate ensures that you get the closest previous match.

In need of an explanation of Web scraping with Nokogiri in Rails

I am hoping to help you with a real world example. Lets get some data from Reuters for example.

In your console try this:

    # require your tools make sure you have gem install nokogiri
pry(main)> require 'nokogiri'
pry(main)> require 'open-uri'

# set the url
pry(main)> url = "http://www.reuters.com/finance/stocks/overview?symbol=0005.HK"

# load and assign to a variable
pry(main)> doc = Nokogiri::HTML(open(url))

# take a piece of the site that has an element style .sectionQuote you can use ids also
pry(main)> quote = doc.css(".sectionQuote")

Now if you have a look in quote you will see you will have Nokogiri elements. Lets have a look inside:

    pry(main)> quote.size
=> 6

pry(main)> quote.first
=> #(Element:0x43ff468 {
name = "div",
attributes = [ #(Attr:0x43ff404 { name = "class", value = "sectionQuote nasdaqChange" })],
children = [
#(Text "\n\t\t\t"),
#(Element:0x43fef18 {
name = "div",
attributes = [ #(Attr:0x43feeb4 { name = "class", value = "sectionQuoteDetail" })],
children = [
#(Text "\n\t\t\t\t"),
#(Element:0x43fe9c8 { name = "span", attributes = [ #(Attr:0x43fe964 { name = "class", value = "nasdaqChangeHeader" })], children = [ #(Text "0005.HK on Hong Kong Stock")] }),
.....
}),
#(Text "\n\t\t")]

})

You can see that nokogiri has essentially encapsulated each DOM element, so that you can search and access it quickly.

if you want to just simply display this div element you can:

pry(main)> quote.first.to_html
=> "<div class=\"sectionQuote nasdaqChange\">\n\t\t\t<div class=\"sectionQuoteDetail\">\n\t\t\t\t<span class=\"nasdaqChangeHeader\">0005.HK on Hong Kong Stock</span>\n\t\t\t\t<br class=\"clear\"><br class=\"clear\">\n\t\t\t\t<span style=\"font-size: 23px;\">\n\t\t\t\t82.85</span><span>HKD</span><br>\n\t\t\t\t<span class=\"nasdaqChangeTime\">14 Aug 2014</span>\n\t\t\t</div>\n\t\t</div>"

and it is possible to use it directly in the view of a rails application.

if you want to be more specific and take individual components and traverse by looping the quote variable for elements one level down, in this instance you can:

 pry(main)> quote.each{|p| puts p.inspect}

Or be very specific and get the value of an element ie the name of the stock in our example:

 pry(main)> quote.at_css(".nasdaqChangeHeader").content
=> "0005.HK on Hong Kong Stock"

This is a very useful link: http://nokogiri.org/tutorials/searching_a_xml_html_document.html

Really hope this helps

PS: A tip for looking inside objects
(http://ruby-doc.org/core-2.1.1/Object.html#method-i-inspect)

puts quote.inspect

Nokogiri: how to find a div by id and see what text it contains?

html = <<-HTML
<html>
<body>
<div id="verify" style="display: none;">foobar</div>
</body>
</html>
HTML
doc = Nokogiri::HTML html
puts 'verified!' if doc.at_css('[id="verify"]').text.eql? 'foobar'

How can I use Nokogiri to write a HUGE XML file?

Nokogiri is designed to build in memory because you build a DOM and it converts it to XML on the fly. It's easy to use, but there are trade-offs, and doing it in memory is one of them.

You might want to look into using Erubis to generate the XML. Rather than gather all the data before processing and keeping the logic in a controller, like we'd do with Rails, to save memory you can put your logic in the template and have it iterate over your data, which should help with the resource demands.

If you need the XML in a file you might need to do that using redirection:

erubis options templatefile.erb > xmlfile

This is a very simple example, but it shows you could easily define a template to generate XML:

<% 
asdf = (1..5).to_a
%>
<xml>
<element>
<% asdf.each do |i| %>
<subelement><%= i %></subelement>
<% end %>
</element>
</xml>

which, when I call erubis test.erb outputs:

<xml>
<element>
<subelement>1</subelement>
<subelement>2</subelement>
<subelement>3</subelement>
<subelement>4</subelement>
<subelement>5</subelement>
</element>
</xml>

EDIT:

The string concatenation was taking forever...

Yes, it can simply because of garbage collection. You don't show any code example of how you're building your strings, but Ruby works better when you use << to append one string to another than when using +.

It also might work better to not try to keep everything in a string, but instead to write it immediately to disk, appending to an open file as you go.

Again, without code examples I'm shooting in the dark about what you might be doing or why things run slow.

How to use Nokogiri to get the full HTML without any text content

NOTE: This is a very aggressive approach. Tags like <script>, <style>, and <noscript> also have child text() nodes containing CSS, HTML, and JS that you might not want to filter out depending on your use case.

If you operate on the parsed document instead of capturing the return value of your iterator, you'll be able to remove the text nodes, and then return the document:

require 'nokogiri'
html = "<html> <body> <div class='example'><span>Hello</span></div></body></html>"

# Parse HTML
doc = Nokogiri::HTML.parse(html)

puts doc.inner_html
# => "<html> <body> <div class=\"example\"><span>Hello</span></div>\n</body>\n</html>"

# Remove text nodes from parsed document
doc.xpath("//text()").each { |t| t.remove }

puts doc.inner_html
# => "<html><body><div class=\"example\"><span></span></div></body></html>"

Nokogiri get all HTML nodes

You could split the OuterXml over InnerXml of all opening elements that are not self closing, store the corresponding closing elements if any to retrieve it and parse the document using the Nokogiri reader to build the list according to the order within the document.

It requires that your document is a valid XML fragment as it is using the XML parser and not the HTML one.

require 'nokogiri'
[ "<html><body><h1>Header1</h1></body></html>",
"<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class=\"style\">d</span>olor</p></div></body></html>", <<END
<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
END
].each { |string_page|
elem_all = Array.new
elem_ends = Hash.new
reader = Nokogiri::XML::Reader(string_page)
reader.each { |node|
if node.node_type.eql?(1)
if node.self_closing?
elem_all << node.outer_xml
else
elem_tags = node.outer_xml.split(node.inner_xml)
elem_all << elem_tags.first
elem_ends[node.local_name] = elem_tags[1] unless elem_tags.one?
end
end
elem_all << elem_ends[node.local_name] if node.node_type.eql?(15) and elem_ends.has_key?(node.local_name)
}

puts string_page
puts elem_all.to_s
puts
}

Outputs:

<html><body><h1>Header1</h1></body></html>
["<html>", "<body>", "<h1>", "</h1>", "</body>", "</html>"]

<html><body><div><h1>Title</h1><hr /></div><div><p>Lorem Ipsum<br />sit <span class="style">d</span>olor</p></div></body></html>
["<html>", "<body>", "<div>", "<h1>", "</h1>", "<hr/>", "</div>", "<div>", "<p>", "<br/>", "<span class=\"style\">", "</span>", "</p>", "</div>", "</body>", "</html>"]

<html>
<body>
<h1>Test</h1>
<p>test <strong> Jojo </strong></p>
</body>
</html>
["<html>", "<body>", "<h1>", "</h1>", "<p>", "<strong>", "</strong>", "</p>", "</body>", "</html>"]


Related Topics



Leave a reply



Submit