Best Rails HTML Parser
You are probably thinking about Nokogiri.
I have not used it myself, but "everyone" is talking about it and the benchmarks do look interesting:
hpricot:html:doc 48.930000 3.640000 52.570000 ( 52.900035)
hpricot2:html:doc 4.500000 0.020000 4.520000 ( 4.518984)
nokogiri:html:doc 3.640000 0.130000 3.770000 ( 3.770642)
Loading a webpage for parsing in Rails
You should try Gems like Hpricot (wiki) or Nokogiri.
Hpricot example:
require 'open-uri'
require 'rubygems'
require 'hpricot'
html = Hpricot(open(an_url).read)
# This would search for any images inside a paragraph (XPath)
html.search('/html/body//p//img')
# This would search for any images with the class "test" (CSS selector)
html.search('img.test')
Nokogiri example:
require 'open-uri'
require 'rubygems'
require 'hpricot'
html = Nokogiri::HTML(open(an_url).read)
# This would search for any images inside a paragraph (XPath)
html.xpath('/html/body//p//img')
# This would search for any images with the class "test" (CSS selector)
html.css('img.test')
Nokogiri is generally faster. Both libraries feature a lot of functionality.
Quickest/easiest way to parse HTML of a website?
Any of the languages you mentioned can do that, as long as you use the correct third-party libraries to help you.
You'll need something that crawls the site. Actually, this could be a completely different program that just downloads the .html files to your computer, on which you'd then let the parser run. Such robots exist, consider wget or curl -- they both have spider options.
You'll need a parser for the site. Don't use regexp to parse HTML, use an HTML or XML parser (like Perl's HTML::Parser). Then you'll have to convert the resulting datastructure to usable data (for example, first table>tr>td is monster name, second td is race, etc.
Finally, you'll need to store those into your database in a way you can recuperate them later to serve for your site.
Actually, writing the code won't be the hardest thing, but the mapping on "which item on the page means what and should be stored where and how" will be.
XML = HTML with Hpricot and Rails
Model, model, model, model, model. Skinny controllers, simple views.
The RedHandedHomePage model does the parsing on initialization, then call 'def render' in the controller, set output to an instance variable, and print that in a view.
How to extract style content from HTML file in Ruby?
You can use Nokogiri
gem
require 'nokogiri'
file = File.open("filepath/index.html", "rb")
page = Nokogiri::HTML(file.read)
first_style_tag = page.css('style')[0]
puts first_style_tag.text
see this tutorial http://ruby.bastardsbook.com/chapters/html-parsing/
Not tested, please try it out
Ruby on Rails: how to render a string as HTML?
UPDATE
For security reasons, it is recommended to use sanitize
instead of html_safe
.
<%= sanitize @str %>
What's happening is that, as a security measure, Rails is escaping your string for you because it might have malicious code embedded in it. But if you tell Rails that your string is html_safe
, it'll pass it right through.
@str = "<b>Hi</b>".html_safe
<%= @str %>
OR
@str = "<b>Hi</b>"
<%= @str.html_safe %>
Using raw
works fine, but all it's doing is converting the string to a string, and then calling html_safe
. When I know I have a string, I prefer calling html_safe
directly, because it skips an unnecessary step and makes clearer what's going on. Details about string-escaping and XSS protection are in this Asciicast.
I can't parse the page and get links Nokogiri
I found a more interesting solution )) for example: link_driver = Nokogiri::HTML(page.source).at('a:contains("mac")').values.join('')
chromedriver_storage_page = 'https://chromedriver.storage.googleapis.com/'
File.new('filename.zip', 'w') << URI.parse(chromedriver_storage+link).read
contains("mac") can change contains("linux") or contains("win"), it does not matter, choose any version of the operating system
And 2 solution - to parse the page chromedriver.chromium.org and to get information about all versions. If the version on the site is more recent than mine, then I substitute the version number in a new line for downloadingchromedriver_storage = 'https://chromedriver.storage.googleapis.com/'
chromedriver = '79.0.3945.36/'
- get using Capybara and cut only the versionzip = 'chromedriver_mac64.zip'
link = chromedriver_storage+chromedriver+zip File.new('filename.zip', 'w') << URI.parse(link).read
it turns out that the parser, in headless mode, can be inserted into the crontab task to update the version of the current browser
Related Topics
Font-Awesome Icons Not Rendering via the Boostrapcdn
Inline Form Nested Within Horizontal Form in Bootstrap 3
How to Align Texts Inside of an Input
Streaming a Video from Google Drive Using HTML5 Video Tag
Ckeditor Strips Inline Attributes
Detecting Real Time Window Size Changes in Angular 4
Vuejs/Browser Caching Production Builds
How to Make a Div Tag into a Link
How to Serve Up Images in Angular2
Why Do The CSS Width and Height Properties Not Adjust for Padding
CSS Calc Not Working in Safari and Fallback
CSS Vertical Align Does Not Work with Float
Table Overflowing Outside of Div
Why Is "&Reg" Being Rendered as "®" Without The Bounding Semicolon
How to Link to Google Maps with a Particular Longitude and Latitude