How to Make a Post Request with Open-Uri

How do I make a POST request with open-uri?

Unfortunately open-uri only supports the GET verb.

You can either drop down a level and use net/http, or use rest-open-uri, which was designed to support POST and other verbs. You can do gem install rest-open-uri to install it.

How to specify http request header in OpenURI

According to the documentation, you can pass a hash of http headers as the second argument to open:

open("http://www.ruby-lang.org/en/",
"User-Agent" => "Ruby/#{RUBY_VERSION}",
"From" => "foo@bar.invalid",
"Referer" => "http://www.ruby-lang.org/") {|f|
# ...
}

Only OpenURI succeeds at Reddit API request

Working with your example, and focussing on Net::HTTP for simplicity, the first example doesn't work as written:

require 'net/http'
reddit_url = URI.parse('https://www.reddit.com/r/pixelart/top.json')
Net::HTTP.get(reddit_url, 'User-Agent' => 'My agent')
# => Type Error - no implicit conversion of URI::HTTPS into String

Instead I used this as my starting point:

require 'net/http'
reddit_url = URI.parse('https://www.reddit.com/r/pixelart/top.json')
http = Net::HTTP.new(reddit_url.host, reddit_url.port)
http.use_ssl = true
result = http.get(reddit_url.request_uri, 'User-Agent' => 'My agent')
puts result
# => #<Net::HTTPOK:0x00007fc3ea8e7320>
puts result.body.size
# => 167,394

With that working we can try the second URL. Interestingly, I get different results depending on whether I re-use the initial connection or make a new one:

require 'net/http'
reddit_url = URI.parse('https://www.reddit.com/r/pixelart/top.json')
reddit_url_two = URI.parse('https://reddit.com/r/PixelArt/comments/lkaiqf/another_watercolour_pixelart_tree.json')

http = Net::HTTP.new(reddit_url.host, reddit_url.port)
http.use_ssl = true
result = http.get(reddit_url.request_uri, 'User-Agent' => 'My agent')
puts result
# => #<Net::HTTPOK:0x00007f931a143390>
puts result.body.size
# => 174,615

http_two = Net::HTTP.new(reddit_url_two.host, reddit_url_two.port)
http_two.use_ssl = true
result_two = http_two.get(reddit_url_two.request_uri, 'User-Agent' => 'My agent')
puts result_two
# => #<Net::HTTPMovedPermanently:0x00007f931a148818>
puts result_two.body.size
# => 0

result_reusing_connection = http.get(reddit_url_two.request_uri, 'User-Agent' => 'My agent')
puts result_reusing_connection
# => #<Net::HTTPOK:0x00007f931a0fb3b0>
puts result_reusing_connection.body.size
# => 141,575

So I suspect you're getting a 301 redirect sometimes and that's causing the confusion. There's another question and answer here for how to follow redirects.

Ruby Proxy Authentication GET/POST with OpenURI or net/http

Try:

require "open-uri"
proxy_uri = URI.parse("http://proxy.com:8000")

data = open("http://www.whatismyipaddress.com/", :proxy_http_basic_authentication => [proxy_uri, "username", "password"]).read
puts data

As for Net::HTTP, I recently implemented support for proxies with http authentication into a Net::HTTP wrapper library called http. If you look at my last pull-request, you'll see the basic implementation.

EDIT: Hopefully this will get you moving in the right direction.

Net::HTTP::Proxy(proxy_uri.host, proxy_uri.port,"username","password").start('whatismyipaddress.com') do |http| 
puts http.get('/').body
end

EDIT 11/24/2020: Net::HTTP::Proxy is now considered obsolete. You can now configure proxies when creating a new instance of Net::HTTP. See the documentation for Net::HTTP.new for more details.

POST with ruby: best practise and how to?

http://ruby-doc.org/stdlib-2.0/libdoc/net/http/rdoc/Net/HTTP.html#method-i-post

From the docs, a summary:

response = http.post('/cgi-bin/search.rb', 'query=foo')

Use case:

# using block
File.open('result.txt', 'w') {|f|
http.post('/cgi-bin/search.rb', 'query=foo') do |str|
f.write str
end
}

Trying to use open-uri in ruby, some HTML contents are coming in as Loading...


The problem

So, open uri just makes HTTP requests and gives you access to the body. In this case, the body is html. That html has a placeholder for this data, which is what you're seeing. Then that html says to load up some javascript that will make another request to the server to get the data, and when the data comes in, it will replace the placeholder with the real data. So, to handle this, you ultimately need whatever is coming back from that request the javascript is making.

Three solutions

Ordered from my least favourite to my most favourite.

  1. You can try to evaluate the JavaScript to have it operate on the html. This is going to be painful, so I don't recommend it, but if you wanted to go down that path, I think there's a gem called "the ruby racer" or something (IIRC, it wraps v8).
  2. You can launch a web browser, let the browser handle all the cray cray, and then ask the browser for the html after it's been updated. This is what Rahul's solution does, and it's a really nice solution. It's not my favourite because it's pretty heavy and you're relegated to information displayed in the html. This is called "scraping", and it's pretty fragile (some designer moves something around the page and your script breaks), and the information is in human presentation format, which means you usually have to do a lot of little parsing things.
  3. You can open your browser's devtools, go to the network tab, filter to the XHR requests, and reload the page. One of these made the request to get the data that was used to fill in the place holder. Figure out which one it is and then you can make that request yourself. There's ways this can be fragile, too, eg sometimes you have to have the right cookies, and you often have to experiment with what the browser sent to figure out how much of it you need (usually it's way less than was sent, which is true for your case). Protip: When you do this, separate requesting the data from parsing and exploring it (ie save it to a file and then, while looking through the data, get it from the file rather than making a new request every time... this way it won't change on you and you won't get rate limited)

Solution #3

So, I was curious and went ahead and tried solution number 3 myself, and it worked pretty admirably, check it out:

require 'uri'
require 'net/http'

# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)

# set some headers
req['origin'] = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache' # no caching, just in case,
req['pragma'] = 'no-cache' # we prob don't want stale data

# looks like you can pass it an awful lot of filters to use
req.set_form_data(
"page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
"distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
"entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
"minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
"filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
"displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
"inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
"daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
"minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
"minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
"maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)

# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"

# parse the response
require 'json'
json = JSON.parse res.body

# we're on page 1 of 1, and there are 48 results on this page
json['page'] # => 1
json['listings'].size # => 48
json['remainingResults'] # => false

# apparently we're looking at some sort of car or smth
json['modelId'] # => "d841"
json['modelName'] # => "Mazda MAZDASPEED6"

# a bunch of places sell this car
json['sellers'].size # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"

# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl'] # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString'] # => "$8,972"
listing['priceString'] # => "$6,890"
listing['daysOnMarket'] # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear'] # => 2006
listing['mileageString'] # => "81,803"

# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] } # => 0
json['listings'].count { |l| l['salvage'] } # => 0

# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
# ["Fair Deal", 11],
# ["No Price Analysis", 23],
# ["High Price", 8],
# ["Overpriced", 2]]


Related Topics



Leave a reply



Submit