Method to Parse HTML Document in Ruby

Method to parse HTML document in Ruby?

There is no built-in HTML parser (yet), but some very good ones are available, in particular Nokogiri.

Meta-answer: For common needs like these, I'd recommend checking out the Ruby Toolbox site. You'll notice that Nokogiri is the top recommendation for HTML parsers

Parse HTML string into array

Here's how you could do it with a SAX parser:

require 'nokogiri'

html = "<p>Here is a paragraph. A sentence with <strong>bold text</strong>.</p><p>The second paragraph.</p>"

class ArraySplitParser < Nokogiri::XML::SAX::Document
  attr_reader :array
  def initialize; @array = []; end
  def start_element(name, attrs=[])
    tag = "<" + name
    attrs.each { |k,v| tag += " #{k}=\"#{v}\"" }
    @array << tag + ">"
  end
  def end_element(name); @array << "</#{name}>"; end
  def characters(str); @array += str.gsub(/\s/, '\0|').split('|'); end
end

parser = ArraySplitParser.new
Nokogiri::XML::SAX::Parser.new(parser).parse(html)
puts parser.array.inspect
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>"]

Note that you'll have to wrap your HTML in a root element so that the XML parser doesn't miss the second paragraph in your example. Something like this should work:

# ...
Nokogiri::XML::SAX::Parser.new(parser).parse('<x>' + html + '</x>')
# ...
puts parser.array[1..-2]
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>", "<p>", "The ", "second ", "paragraph.", "</p>"]

[Edit] Updated to demonstrate how to retain element attributes in the "start_element" method.

Parse HTML string with Nokogiri

The problem is that you have double-quotes within your string which are confusing the parser, because you're also using double-quotes to surround the string. To illustrate:

puts "foo"bar"
# => SyntaxError: unexpected tIDENTIFIER, expecting end-of-input
#    puts "foo"bar"
#                 ^

You might intend for this to print foo"bar, but when the parser gets to the second " (after foo) it thinks the string is over, and so the stuff after it causes a syntax error. (Stack Overflow's syntax highlighting even gives you a hint—see how on the first line "foo" is colored differently from bar"? A good syntax-highlighting text editor will do the same thing.)

One solution is to use a single-quote instead:

puts 'bar"baz'
# => bar"baz

That fixes the problem in this case, but won't actually help you because your string also has single-quotes inside it!

Another solution is to escape your quotation marks by preceding them with a \, like so:

puts "foo\"bar"
# => foo"bar

...but that gets a little tedious (and sometimes tricky) for long strings like yours. A better solution is to use a special kind of string called a "heredoc" (for "here document," for what it's worth):

str = <<-END_OF_HTML
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

html_doc = Nokogiri::HTML(str)

The delimiter "END_OF_HTML" is arbitrary. You could use EOF or XYZZY or whatever suits your fancy instead, although it's a good idea to use something meaningful. (You'll notice that Stack Overflow's syntax highlighting has a little trouble with heredocs; most code editors do fine with them, though.)

You can make this a little more compact like this:

Nokogiri::HTML <<-END_OF_HTML
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

Or with parentheses (it looks a little odd, but it works, and is sometimes necessary):

Nokogiri::HTML(<<-END_OF_HTML)
  <html>  <meta content="text/html; charset=UTF-8"/>  <body style='margin:20px'>    <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>    <ul style='list-style-type:none; margin:25px 15px;'>      <li><b>User name:</b> Test User</li>      <li><b>User email:</b> test@abc.com</li>      <li><b>Identifier:</b> abc123def132afd1213afas</li>      <li><b>Description:</b> Tom's iPad</li>      <li><b>Model:</b> iPad 3</li>      <li><b>Platform:</b> </li>      <li><b>App:</b> Test app name</li>      <li><b>UserID:</b> </li>     </ul>    <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p>            <hr style='height=2px; color:#aaa'/>        <p>We hope you enjoy the app store experience!</p>        <p style='font-size:18px; color:#999'>Powered by App47</p>      <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML

You can read more about heredocs, and other ways to represent strings, in the Literals section of the Ruby documentation.

simple parsing in ruby

html = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><link rel="stylesheet" type="text/css" href="http://2.ai/styles/hello.css" media="screen"/><title>Welcome to Dotgeek.org * 1.ai</title></head>'
html.match(/<title>(.*)<\/title>/)[1] #=> "Welcome to Dotgeek.org * 1.ai"

Parse HTML (without HTML semantics being followed) using Nokogiri

I ended up using Nokogiri::XML parser for parsing the HTML doc

I had to change my script at numerous places

Parsing code

@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!

Changes Done

change attribute method to attr
chaining attr with text method is not needed here
need to check about the invalid HTML5 tags though
some more parsing logic changes were needed
node.to_html works like a charm here so i was able to store complete HTML in db

HTML Parser into DOM in Ruby

Despite your remark, Nokogiri is the way to go:

doc = Nokogiri::HTML('<body><p>Hello, worlds!</body>')

It parses even invalid HTML and returns a DOM tree:

>> doc.class
=> Nokogiri::HTML::Document
>> doc.root.class
=> Nokogiri::XML::Element
>> doc.root.children.class
=> Nokogiri::XML::NodeSet
>> doc.root.children.first.content
=> "Hello, worlds!"

Parse HTML using ruby core libraries? (ie, no gems required)

There is no html parser in ruby stdlib

html parsers have to be more forgiving of bad markup than xml parsers

You could run the html though tidy (http://tidy.sourceforge.net)

to tidy up the html and produce valid markup

This can now be read via rexml :-) which is in stdlib

rexml is much slower than nokogiri, last checked in 2009

Sam Ruby had been working on making rexml faster though

A better way would be to have a better deployment

Take a look at http://gembundler.com/bundle_package.html and using capistrano (or some such) to provision servers

Method to Parse HTML Document in Ruby