What is the difference between Ruby's 'open-uri' and 'Net:HTTP' gems?
The reason they look like they perform similar tasks is OpenURI is a wrapper for Net::HTTP, Net::HTTPS, and Net::FTP.
Usually, unless you feel you need a lower level interface, using OpenURI is better as you can get by with less code. Using OpenURI you can open a URL/URI and treat it as a file.
See: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/open-uri/rdoc/OpenURI.html
and http://ruby-doc.org/stdlib-1.9.3//libdoc/net/http/rdoc/Net.html
Ruby: what is the difference between using open() and NET::HTTP module to fetch web content?
Rule of thumb: Use OpenURI
whenever you can.
The reason is that OpenURI
is just a wrapper around Net::HTTP
, therefore it will require less code to be written. So if all you do is performing simple GET requests, go for it.
On the other hand, prefer Net::HTTP
if you want some lower-level functionality that you OpenURI
does not give you. It is not a better approach, but it provides more flexibility in terms of configuration.
As the official documentation states:
If you are only performing a few GET requests you should try OpenURI.
Ruby/Rails Performance: OpenURI vs NET:HTTP vs Curb vs Rest-Client
I get this kind of results with the next benchmark, by retrieving data from Google.
Warming up --------------------------------------
OpenURI 3.000 i/100ms
Net::HTTP 3.000 i/100ms
curb 3.000 i/100ms
rest_client 3.000 i/100ms
Calculating -------------------------------------
OpenURI 34.848 (±11.5%) i/s - 687.000 in 20.013469s
Net::HTTP 35.433 (±14.1%) i/s - 594.000 in 20.006947s
curb 31.612 (±19.0%) i/s - 465.000 in 20.021108s
rest_client 34.331 (±11.7%) i/s - 675.000 in 20.044486s
Comparison:
Net::HTTP: 35.4 i/s
OpenURI: 34.8 i/s - same-ish: difference falls within error
rest_client: 34.3 i/s - same-ish: difference falls within error
curb: 31.6 i/s - same-ish: difference falls within error
And here is a source code of the benchmark
require 'benchmark/ips'
require 'open-uri'
require 'net/http'
require 'curb'
require 'rest-client'
google_uri = URI('http://www.google.com/')
google_uri_string = google_uri.to_s
Benchmark.ips do |x|
x.config(time: 20, warmup: 10)
x.report('OpenURI') { open(google_uri_string) }
x.report('Net::HTTP') { Net::HTTP.get(google_uri) }
x.report('curb') { Curl.get(google_uri_string) }
x.report('rest_client') { RestClient.get(google_uri_string) }
x.compare!
end
ENVIRONMENT:
- AWS EC2 server
- Ruby version - 2.5.1p57
- Gems: curb-0.9.6, rest-client-2.0.2, benchmark-ips-2.7.2
NOTES:
Don't forget to install gems before running this benchmark
gem install curb rest-client benchmark-ips
To get more accurate results run in a stable network environment, like production servers
Retrieve contents of URL as string
The open
method passes an IO
representation of the resource to your block when it yields. You can read from it using the IO#read
method
open([mode [, perm]] [, options]) [{|io| ... }]
open(path) { |io| data = io.read }
503 error when using ruby's open-uri to access a specific site
There are workarounds, but the best idea is to be a good citizen according to their terms.
You might want to confirm that you are following their Terms of Service:
If you operate a search engine or robot, or you republish a significant fraction of all Quora Content (as we may determine in our reasonable discretion), you must additionally follow these rules:
- You must use a descriptive user agent header.
- You must follow robots.txt at all times.
- You must make it clear how to contact you, either in your user agent string, or on your website if you have one.
You can set your user-agent header easily using OpenURI:
Additional header fields can be specified by an optional hash argument.
open("http://www.ruby-lang.org/en/",
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "foo@bar.invalid",
"Referer" => "http://www.ruby-lang.org/") {|f|
# ...
}
Robots.txt can be retrieved from http://www.quora.com/robots.txt
. You'll need to parse it and honor its settings or they'll ban you again.
Also, you might want to restrict the speed of your code by sleeping between loops.
Also, if you are spidering their site for content, you might want to look into caching pages locally, or using one of the spidering packages. It's easy to write a spider. It's more work to write one that plays nicely with a site but better that than not be able to spider their site at all.
how do I include a header in an http request in ruby
get_response
is a shorthand for making a request, when you need more control - do a full request yourself.
There's an example in ruby standard library here:
uri = URI.parse("http://www.ncdc.noaa.gov/cdo-web/api/v2/datasets/")
req = Net::HTTP::Get.new(uri)
req['token'] = 'fjhKJFSDHKJHjfgsdfdsljh'
res = Net::HTTP.start(uri.hostname, uri.port) {|http|
http.request(req)
}
Related Topics
Command for Displaying a Gem's Dependencies
Using Bsearch to Find Index for Inserting New Element into Sorted Array
Simplify Multiple Nil Checking in Rails
Change Default Date Format in Ruby on Rails
R Statistical Package Gem for a Rails Application
How to Rescue from a 'Require': No Such File to Load in Ruby
-': Nil Can't Be Coerced into Fixnum (Typeerror)
Understanding Namespaces in Ruby
Trying to Get Content Inside Cdata Tags in Xml File Using Nokogiri
Check Method Call on Model Using Minitest
Why Does .All? Return True on an Empty Array
Rails Custom Validation Based on a Regex
Stop Loading Page Watir-Webdriver
Why Does Ruby Builder::Xmlmarkup Add Inspect Tag to Xml
Undefined Local Variable Based on Syntax in Ruby
Match Sequences of Consecutive Characters in a String
Cannot Start Rails Server, "No Such File to Load -- Bundler/Setup"