Method to parse HTML document in Ruby?
There is no built-in HTML parser (yet), but some very good ones are available, in particular Nokogiri.
Meta-answer: For common needs like these, I'd recommend checking out the Ruby Toolbox site. You'll notice that Nokogiri is the top recommendation for HTML parsers
Parse HTML string into array
Here's how you could do it with a SAX parser:
require 'nokogiri'
html = "<p>Here is a paragraph. A sentence with <strong>bold text</strong>.</p><p>The second paragraph.</p>"
class ArraySplitParser < Nokogiri::XML::SAX::Document
attr_reader :array
def initialize; @array = []; end
def start_element(name, attrs=[])
tag = "<" + name
attrs.each { |k,v| tag += " #{k}=\"#{v}\"" }
@array << tag + ">"
end
def end_element(name); @array << "</#{name}>"; end
def characters(str); @array += str.gsub(/\s/, '\0|').split('|'); end
end
parser = ArraySplitParser.new
Nokogiri::XML::SAX::Parser.new(parser).parse(html)
puts parser.array.inspect
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>"]
Note that you'll have to wrap your HTML in a root element so that the XML parser doesn't miss the second paragraph in your example. Something like this should work:
# ...
Nokogiri::XML::SAX::Parser.new(parser).parse('<x>' + html + '</x>')
# ...
puts parser.array[1..-2]
# ["<p>", "Here ", "is ", "a ", "paragraph. ", "A ", "sentence ", "with ", "<strong>", "bold ", "text", "</strong>", ".", "</p>", "<p>", "The ", "second ", "paragraph.", "</p>"]
[Edit] Updated to demonstrate how to retain element attributes in the "start_element" method.
Parse HTML string with Nokogiri
The problem is that you have double-quotes within your string which are confusing the parser, because you're also using double-quotes to surround the string. To illustrate:
puts "foo"bar"
# => SyntaxError: unexpected tIDENTIFIER, expecting end-of-input
# puts "foo"bar"
# ^
You might intend for this to print foo"bar
, but when the parser gets to the second "
(after foo
) it thinks the string is over, and so the stuff after it causes a syntax error. (Stack Overflow's syntax highlighting even gives you a hint—see how on the first line "foo"
is colored differently from bar"
? A good syntax-highlighting text editor will do the same thing.)
One solution is to use a single-quote instead:
puts 'bar"baz'
# => bar"baz
That fixes the problem in this case, but won't actually help you because your string also has single-quotes inside it!
Another solution is to escape your quotation marks by preceding them with a \
, like so:
puts "foo\"bar"
# => foo"bar
...but that gets a little tedious (and sometimes tricky) for long strings like yours. A better solution is to use a special kind of string called a "heredoc" (for "here document," for what it's worth):
str = <<-END_OF_HTML
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
html_doc = Nokogiri::HTML(str)
The delimiter "END_OF_HTML
" is arbitrary. You could use EOF
or XYZZY
or whatever suits your fancy instead, although it's a good idea to use something meaningful. (You'll notice that Stack Overflow's syntax highlighting has a little trouble with heredocs; most code editors do fine with them, though.)
You can make this a little more compact like this:
Nokogiri::HTML <<-END_OF_HTML
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
Or with parentheses (it looks a little odd, but it works, and is sometimes necessary):
Nokogiri::HTML(<<-END_OF_HTML)
<html> <meta content="text/html; charset=UTF-8"/> <body style='margin:20px'> <p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p> <ul style='list-style-type:none; margin:25px 15px;'> <li><b>User name:</b> Test User</li> <li><b>User email:</b> test@abc.com</li> <li><b>Identifier:</b> abc123def132afd1213afas</li> <li><b>Description:</b> Tom's iPad</li> <li><b>Model:</b> iPad 3</li> <li><b>Platform:</b> </li> <li><b>App:</b> Test app name</li> <li><b>UserID:</b> </li> </ul> <p>Review user: https://cirrus.app47.com/users?search=test@abc.com</p> <hr style='height=2px; color:#aaa'/> <p>We hope you enjoy the app store experience!</p> <p style='font-size:18px; color:#999'>Powered by App47</p> <img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
You can read more about heredocs, and other ways to represent strings, in the Literals section of the Ruby documentation.
simple parsing in ruby
html = '<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><link rel="stylesheet" type="text/css" href="http://2.ai/styles/hello.css" media="screen"/><title>Welcome to Dotgeek.org * 1.ai</title></head>'
html.match(/<title>(.*)<\/title>/)[1] #=> "Welcome to Dotgeek.org * 1.ai"
Parse HTML (without HTML semantics being followed) using Nokogiri
I ended up using Nokogiri::XML
parser for parsing the HTML
doc
I had to change my script at numerous places
Parsing code
@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!
Changes Done
- change
attribute
method toattr
- chaining
attr
withtext
method is not needed here - need to check about the invalid HTML5 tags though
- some more parsing logic changes were needed
node.to_html
works like a charm here so i was able to store complete HTML in db
HTML Parser into DOM in Ruby
Despite your remark, Nokogiri is the way to go:
doc = Nokogiri::HTML('<body><p>Hello, worlds!</body>')
It parses even invalid HTML and returns a DOM tree:
>> doc.class
=> Nokogiri::HTML::Document
>> doc.root.class
=> Nokogiri::XML::Element
>> doc.root.children.class
=> Nokogiri::XML::NodeSet
>> doc.root.children.first.content
=> "Hello, worlds!"
Parse HTML using ruby core libraries? (ie, no gems required)
There is no html parser in ruby stdlib
html parsers have to be more forgiving of bad markup than xml parsers
You could run the html though tidy (http://tidy.sourceforge.net)
to tidy up the html and produce valid markup
This can now be read via rexml :-) which is in stdlib
rexml is much slower than nokogiri, last checked in 2009
Sam Ruby had been working on making rexml faster though
A better way would be to have a better deployment
Take a look at http://gembundler.com/bundle_package.html and using capistrano (or some such) to provision servers
Related Topics
Restarting a Loop from the Top
How to Check If a Value Exists in an Array in Ruby
How to Parse Json With Ruby on Rails
Error: While Executing Gem ... (Gem::Filepermissionerror)
Rails Activerecord Perform Group, Sum and Count in One Query
How to Calculate Number of Chars Common to Two Strings
Why Is It Bad Style to 'Rescue Exception =≫ E' in Ruby
How to Call Shell Commands from Ruby
Why Is "Slurping" a File Not a Good Practice
How to Remove Rvm (Ruby Version Manager) from My System
How to Search Within an Array of Hashes by Hash Values in Ruby
What Do 'I' and '-I' in Regex Mean
Why Is the Shovel Operator (≪≪) Preferred Over Plus-Equals (+=) When Building a String in Ruby