Save Image with Mechanize and Nokogiri

Ruby Madness Downloading same file with Nokogiri, Mechanize and OpenUri to get different information

agent = Mechanize.new
thumbs.each do |thumb|
imgUrl = thumb.css('.t_img').first['src']
imgTitle = thumb.css('.t_img').first['alt']
image = agent.get(imgSrc)
p image
puts "1 Driver : prowl.rb"
puts "1 Source : " + pageURL
puts "1 Title : " + imgTitle
puts "1 File Source : " + imgUrl
puts "1 File Type : " + image.header['content-type']
puts "1 File Name : " + image.filename
puts "1 Last Modified : " + image.header["last-modified"]
puts "1 Image Size : " + image.header["content-length"]
puts "1 MD5 : " + GetMD5(*[image.content.to_s])
puts "1 SHA256 : " + GetSha256(*[image.content.to_s])
end

Here it is. Reuse the agent, there is no point in creating a new one every time.

Get the page directly from Mechanize, no nead to open and read then pass the content around. All the header information you are looking for is in the header attribute of your page.

Scraping a webpage with Mechanize and Nokogiri and storing data in XML doc

the file is treated I think, but it doesnt create an xml file in the specified path.

There is nothing in your code that creates a file. You print some output, but don't do anything to open or write a file.

Perhaps you should read the IO and File documentation and review how you are using your filepath variable?

The second problem is that you don't call your method anywhere. Though it's defined and Ruby will see it and parse the method, it has no idea what you want to do with it unless you invoke the method:

def mechanize_club
...
end

mechanize_club()

How to scrape script tags with Nokogiri and Mechanize

Mechanize is overkill if all you are using it for is to retrieve a page. There are many HTTP client gems that'll easily do that, or use OpenURI which is part of Ruby's standard library.

This is the basics for retrieving the information. You'll need to figure out which particular script you want but Nokogiri's tutorials will give you the basics:

require 'json'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('https://www.foodpantries.org/li/st._paul_the_apostle_11368'))

At this point Nokogiri has a DOM created of the page in memory.

Find the <script> node you want, and extract the text of the node:

js = doc.at('script[type="application/ld+json"]').text

at and search are the workhorses for parsing a page. There are CSS and XPath specific variants, but generally you can use the generic versions and Nokogiri will figure out which to use. All are documented on the same page as at and search and the tutorials.

JSON is smart and allows us to use a shorthand of JSON[...] to parse or generate a JSON string. In this case it's parsing a string back into a Ruby object, which in this instance is a hash:

JSON[js]
# => {"@context"=>"https://schema.org",
# "@type"=>"Organization",
# "url"=>"https://www.foodpantries.org/",
# "sameAs"=>[],
# "contactPoint"=>
# [{"@type"=>"ContactPoint",
# "contactType"=>"customer service",
# "url"=>"https://www.foodpantries.org/ar/about",
# "email"=>"webmaster@foodpantries.org"}]}

Accessing a particular key/value pair is simple, just as with any other hash:

foo = JSON[js]
foo['url'] # => "https://www.foodpantries.org/"

The page you're referring to has multiple scripts that match the selector I used, so you'll want to filter using a more exact selector, or iterate over the matches and pick the one you want. How to do that is well documented here on SO using CSS, XPath and by Nokogiri's documentation.

extracting runs of text with Mechanize/Nokogiri

Borrowing from an answer in "Nokogiri recursively get all children":

result = []
doc.traverse { |node| result << node.text if node.text? }

That should give you the array ["Here is ", "some", " text"].

"Getting Mugged by Nokogiri" discusses traverse.

Rails fetching price using nokogiri and mechanize

Try following scraping statement. Hope this works for you

 doc.css("div #buyPriceBox .pdp-e-i-PAY div.pdp-e-i-PAY-r span span.payBlkBig").text

Nokogiri and Mechanize help (navigating to pages via div class and scraping)

It's important to make sure the a[:href]'s are converted to absolute urls first though.
Therefore, maybe:

page.search('.subtitleLink a').map{|a| page.uri.merge a[:href]}.each do |uri|
page2 = agent.get uri
end


Related Topics



Leave a reply



Submit