Differencebetween Ruby's 'Open-Uri' and 'Net:Http' Gems

What is the difference between Ruby's 'open-uri' and 'Net:HTTP' gems?

The reason they look like they perform similar tasks is OpenURI is a wrapper for Net::HTTP, Net::HTTPS, and Net::FTP.

Usually, unless you feel you need a lower level interface, using OpenURI is better as you can get by with less code. Using OpenURI you can open a URL/URI and treat it as a file.

See: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/open-uri/rdoc/OpenURI.html
and http://ruby-doc.org/stdlib-1.9.3//libdoc/net/http/rdoc/Net.html

Ruby: what is the difference between using open() and NET::HTTP module to fetch web content?

Rule of thumb: Use OpenURI whenever you can.

The reason is that OpenURI is just a wrapper around Net::HTTP, therefore it will require less code to be written. So if all you do is performing simple GET requests, go for it.

On the other hand, prefer Net::HTTP if you want some lower-level functionality that you OpenURI does not give you. It is not a better approach, but it provides more flexibility in terms of configuration.

As the official documentation states:

If you are only performing a few GET requests you should try OpenURI.

Ruby/Rails Performance: OpenURI vs NET:HTTP vs Curb vs Rest-Client

I get this kind of results with the next benchmark, by retrieving data from Google.

Warming up --------------------------------------
             OpenURI     3.000  i/100ms
           Net::HTTP     3.000  i/100ms
                curb     3.000  i/100ms
         rest_client     3.000  i/100ms
Calculating -------------------------------------
             OpenURI     34.848  (±11.5%) i/s -    687.000  in  20.013469s
           Net::HTTP     35.433  (±14.1%) i/s -    594.000  in  20.006947s
                curb     31.612  (±19.0%) i/s -    465.000  in  20.021108s
         rest_client     34.331  (±11.7%) i/s -    675.000  in  20.044486s

Comparison:
           Net::HTTP:       35.4 i/s
             OpenURI:       34.8 i/s - same-ish: difference falls within error
         rest_client:       34.3 i/s - same-ish: difference falls within error
                curb:       31.6 i/s - same-ish: difference falls within error

And here is a source code of the benchmark

require 'benchmark/ips'
require 'open-uri'
require 'net/http'
require 'curb'
require 'rest-client'

google_uri = URI('http://www.google.com/')
google_uri_string = google_uri.to_s

Benchmark.ips do |x|
  x.config(time: 20, warmup: 10)
  x.report('OpenURI') { open(google_uri_string) }
  x.report('Net::HTTP') { Net::HTTP.get(google_uri) }
  x.report('curb') { Curl.get(google_uri_string) }
  x.report('rest_client') { RestClient.get(google_uri_string) }
  x.compare!
end

ENVIRONMENT:

AWS EC2 server
Ruby version - 2.5.1p57
Gems: curb-0.9.6, rest-client-2.0.2, benchmark-ips-2.7.2

NOTES:

Don't forget to install gems before running this benchmark

gem install curb rest-client benchmark-ips

To get more accurate results run in a stable network environment, like production servers

Retrieve contents of URL as string

The open method passes an IO representation of the resource to your block when it yields. You can read from it using the IO#read method

open([mode [, perm]] [, options]) [{|io| ... }] 
open(path) { |io| data = io.read }

503 error when using ruby's open-uri to access a specific site

There are workarounds, but the best idea is to be a good citizen according to their terms.
You might want to confirm that you are following their Terms of Service:

If you operate a search engine or robot, or you republish a significant fraction of all Quora Content (as we may determine in our reasonable discretion), you must additionally follow these rules:

You must use a descriptive user agent header.
You must follow robots.txt at all times.
You must make it clear how to contact you, either in your user agent string, or on your website if you have one.

You can set your user-agent header easily using OpenURI:

Additional header fields can be specified by an optional hash argument.

  open("http://www.ruby-lang.org/en/",
    "User-Agent" => "Ruby/#{RUBY_VERSION}",
    "From" => "foo@bar.invalid",
    "Referer" => "http://www.ruby-lang.org/") {|f|
    # ...
  }

Robots.txt can be retrieved from http://www.quora.com/robots.txt. You'll need to parse it and honor its settings or they'll ban you again.

Also, you might want to restrict the speed of your code by sleeping between loops.

Also, if you are spidering their site for content, you might want to look into caching pages locally, or using one of the spidering packages. It's easy to write a spider. It's more work to write one that plays nicely with a site but better that than not be able to spider their site at all.

how do I include a header in an http request in ruby

get_response is a shorthand for making a request, when you need more control - do a full request yourself.

There's an example in ruby standard library here:

uri = URI.parse("http://www.ncdc.noaa.gov/cdo-web/api/v2/datasets/")
req = Net::HTTP::Get.new(uri)
req['token'] = 'fjhKJFSDHKJHjfgsdfdsljh'

res = Net::HTTP.start(uri.hostname, uri.port) {|http|
  http.request(req)
}