Best Rails HTML Parser

Best Rails HTML Parser

You are probably thinking about Nokogiri.
I have not used it myself, but "everyone" is talking about it and the benchmarks do look interesting:

hpricot:html:doc  48.930000 3.640000 52.570000 ( 52.900035)
hpricot2:html:doc 4.500000 0.020000 4.520000 ( 4.518984)
nokogiri:html:doc 3.640000 0.130000 3.770000 ( 3.770642)

Loading a webpage for parsing in Rails

You should try Gems like Hpricot (wiki) or Nokogiri.

Hpricot example:

require 'open-uri'
require 'rubygems'
require 'hpricot'

html = Hpricot(open(an_url).read)
# This would search for any images inside a paragraph (XPath)
html.search('/html/body//p//img')
# This would search for any images with the class "test" (CSS selector)
html.search('img.test')

Nokogiri example:

require 'open-uri'
require 'rubygems'
require 'hpricot'

html = Nokogiri::HTML(open(an_url).read)
# This would search for any images inside a paragraph (XPath)
html.xpath('/html/body//p//img')
# This would search for any images with the class "test" (CSS selector)
html.css('img.test')

Nokogiri is generally faster. Both libraries feature a lot of functionality.

Quickest/easiest way to parse HTML of a website?

Any of the languages you mentioned can do that, as long as you use the correct third-party libraries to help you.

You'll need something that crawls the site. Actually, this could be a completely different program that just downloads the .html files to your computer, on which you'd then let the parser run. Such robots exist, consider wget or curl -- they both have spider options.

You'll need a parser for the site. Don't use regexp to parse HTML, use an HTML or XML parser (like Perl's HTML::Parser). Then you'll have to convert the resulting datastructure to usable data (for example, first table>tr>td is monster name, second td is race, etc.

Finally, you'll need to store those into your database in a way you can recuperate them later to serve for your site.

Actually, writing the code won't be the hardest thing, but the mapping on "which item on the page means what and should be stored where and how" will be.

XML = HTML with Hpricot and Rails

Model, model, model, model, model. Skinny controllers, simple views.

The RedHandedHomePage model does the parsing on initialization, then call 'def render' in the controller, set output to an instance variable, and print that in a view.

How to extract style content from HTML file in Ruby?

You can use Nokogiri gem

require 'nokogiri'

file = File.open("filepath/index.html", "rb")
page = Nokogiri::HTML(file.read)
first_style_tag = page.css('style')[0]
puts first_style_tag.text

see this tutorial http://ruby.bastardsbook.com/chapters/html-parsing/

Not tested, please try it out

Ruby on Rails: how to render a string as HTML?

UPDATE

For security reasons, it is recommended to use sanitize instead of html_safe.

<%= sanitize @str %>

What's happening is that, as a security measure, Rails is escaping your string for you because it might have malicious code embedded in it. But if you tell Rails that your string is html_safe, it'll pass it right through.

@str = "<b>Hi</b>".html_safe
<%= @str %>

OR

@str = "<b>Hi</b>"
<%= @str.html_safe %>

Using raw works fine, but all it's doing is converting the string to a string, and then calling html_safe. When I know I have a string, I prefer calling html_safe directly, because it skips an unnecessary step and makes clearer what's going on. Details about string-escaping and XSS protection are in this Asciicast.

I can't parse the page and get links Nokogiri

I found a more interesting solution )) for example:
link_driver = Nokogiri::HTML(page.source).at('a:contains("mac")').values.join('')
chromedriver_storage_page = 'https://chromedriver.storage.googleapis.com/'
File.new('filename.zip', 'w') << URI.parse(chromedriver_storage+link).read

contains("mac") can change contains("linux") or contains("win"), it does not matter, choose any version of the operating system

And 2 solution - to parse the page chromedriver.chromium.org and to get information about all versions. If the version on the site is more recent than mine, then I substitute the version number in a new line for downloading

chromedriver_storage = 'https://chromedriver.storage.googleapis.com/'
chromedriver = '79.0.3945.36/' - get using Capybara and cut only the version

zip = 'chromedriver_mac64.zip'
link = chromedriver_storage+chromedriver+zip File.new('filename.zip', 'w') << URI.parse(link).read

it turns out that the parser, in headless mode, can be inserted into the crontab task to update the version of the current browser



Related Topics



Leave a reply



Submit