Ruby Madness Downloading same file with Nokogiri, Mechanize and OpenUri to get different information
agent = Mechanize.new
thumbs.each do |thumb|
imgUrl = thumb.css('.t_img').first['src']
imgTitle = thumb.css('.t_img').first['alt']
image = agent.get(imgSrc)
p image
puts "1 Driver : prowl.rb"
puts "1 Source : " + pageURL
puts "1 Title : " + imgTitle
puts "1 File Source : " + imgUrl
puts "1 File Type : " + image.header['content-type']
puts "1 File Name : " + image.filename
puts "1 Last Modified : " + image.header["last-modified"]
puts "1 Image Size : " + image.header["content-length"]
puts "1 MD5 : " + GetMD5(*[image.content.to_s])
puts "1 SHA256 : " + GetSha256(*[image.content.to_s])
end
Here it is. Reuse the agent, there is no point in creating a new one every time.
Get the page directly from Mechanize, no nead to open and read then pass the content around. All the header information you are looking for is in the header
attribute of your page.
Scraping a webpage with Mechanize and Nokogiri and storing data in XML doc
the file is treated I think, but it doesnt create an xml file in the specified path.
There is nothing in your code that creates a file. You print some output, but don't do anything to open
or write
a file.
Perhaps you should read the IO and File documentation and review how you are using your filepath
variable?
The second problem is that you don't call your method anywhere. Though it's defined and Ruby will see it and parse the method, it has no idea what you want to do with it unless you invoke the method:
def mechanize_club
...
end
mechanize_club()
How to scrape script tags with Nokogiri and Mechanize
Mechanize is overkill if all you are using it for is to retrieve a page. There are many HTTP client gems that'll easily do that, or use OpenURI which is part of Ruby's standard library.
This is the basics for retrieving the information. You'll need to figure out which particular script you want but Nokogiri's tutorials will give you the basics:
require 'json'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('https://www.foodpantries.org/li/st._paul_the_apostle_11368'))
At this point Nokogiri has a DOM created of the page in memory.
Find the <script>
node you want, and extract the text of the node:
js = doc.at('script[type="application/ld+json"]').text
at
and search
are the workhorses for parsing a page. There are CSS and XPath specific variants, but generally you can use the generic versions and Nokogiri will figure out which to use. All are documented on the same page as at
and search
and the tutorials.
JSON is smart and allows us to use a shorthand of JSON[...]
to parse or generate a JSON string. In this case it's parsing a string back into a Ruby object, which in this instance is a hash:
JSON[js]
# => {"@context"=>"https://schema.org",
# "@type"=>"Organization",
# "url"=>"https://www.foodpantries.org/",
# "sameAs"=>[],
# "contactPoint"=>
# [{"@type"=>"ContactPoint",
# "contactType"=>"customer service",
# "url"=>"https://www.foodpantries.org/ar/about",
# "email"=>"webmaster@foodpantries.org"}]}
Accessing a particular key/value pair is simple, just as with any other hash:
foo = JSON[js]
foo['url'] # => "https://www.foodpantries.org/"
The page you're referring to has multiple scripts that match the selector I used, so you'll want to filter using a more exact selector, or iterate over the matches and pick the one you want. How to do that is well documented here on SO using CSS, XPath and by Nokogiri's documentation.
extracting runs of text with Mechanize/Nokogiri
Borrowing from an answer in "Nokogiri recursively get all children":
result = []
doc.traverse { |node| result << node.text if node.text? }
That should give you the array ["Here is ", "some", " text"]
.
"Getting Mugged by Nokogiri" discusses traverse
.
Rails fetching price using nokogiri and mechanize
Try following scraping statement. Hope this works for you
doc.css("div #buyPriceBox .pdp-e-i-PAY div.pdp-e-i-PAY-r span span.payBlkBig").text
Nokogiri and Mechanize help (navigating to pages via div class and scraping)
It's important to make sure the a[:href]
's are converted to absolute urls first though.
Therefore, maybe:
page.search('.subtitleLink a').map{|a| page.uri.merge a[:href]}.each do |uri|
page2 = agent.get uri
end
Related Topics
Cannot Load Such File -- Rack/Handler/Puma
Uploading a File to a Website with Ruby/Rails
How to 'Unload' ('Un-Require') a Ruby Library
/Usr/Bin/Env Ruby_Noexec_Wrapper Fails with No File or Directory
Saml 2.0 Sso for Ruby on Rails
Find Number of Bytes a Particular Hash Is Using in Ruby
Yielding in an Anonymous Block
Why Will a Range Not Work When Descending
Is There an Equivalent of Array#Find_Index for the Last Index in Ruby
How to Best Handle Per-Model Database Connections with Activerecord
Extract All Urls Inside a String in Ruby
Phonegap Mobile Rails Authentication (Devise? Authentication from Scratch)
Rails Initializes Extremely Slow on Ruby 1.9.1
Rails 5.0.0 When Installing "Nio4R":Failed to Build Gem Native Extension
Openssl Trouble with Ruby 1.9.3
Using Ruby to Generate Sha512 Crypt-Style Hashes Formatted for /Etc/Shadow
Trying to Set Up Amazon's S3 Bucket: 403 Forbidden Error & Setting Permissions