How to Get The Page Source with Mechanize/Nokogiri

How to get the page source with Mechanize/Nokogiri

Use .body:

puts jobShortListPg.body

Scraping a webpage with Mechanize and Nokogiri and storing data in XML doc

the file is treated I think, but it doesnt create an xml file in the specified path.

There is nothing in your code that creates a file. You print some output, but don't do anything to open or write a file.

Perhaps you should read the IO and File documentation and review how you are using your filepath variable?

The second problem is that you don't call your method anywhere. Though it's defined and Ruby will see it and parse the method, it has no idea what you want to do with it unless you invoke the method:

def mechanize_club
...
end

mechanize_club()

How to scrape script tags with Nokogiri and Mechanize

Mechanize is overkill if all you are using it for is to retrieve a page. There are many HTTP client gems that'll easily do that, or use OpenURI which is part of Ruby's standard library.

This is the basics for retrieving the information. You'll need to figure out which particular script you want but Nokogiri's tutorials will give you the basics:

require 'json'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('https://www.foodpantries.org/li/st._paul_the_apostle_11368'))

At this point Nokogiri has a DOM created of the page in memory.

Find the <script> node you want, and extract the text of the node:

js = doc.at('script[type="application/ld+json"]').text

at and search are the workhorses for parsing a page. There are CSS and XPath specific variants, but generally you can use the generic versions and Nokogiri will figure out which to use. All are documented on the same page as at and search and the tutorials.

JSON is smart and allows us to use a shorthand of JSON[...] to parse or generate a JSON string. In this case it's parsing a string back into a Ruby object, which in this instance is a hash:

JSON[js]
# => {"@context"=>"https://schema.org",
# "@type"=>"Organization",
# "url"=>"https://www.foodpantries.org/",
# "sameAs"=>[],
# "contactPoint"=>
# [{"@type"=>"ContactPoint",
# "contactType"=>"customer service",
# "url"=>"https://www.foodpantries.org/ar/about",
# "email"=>"webmaster@foodpantries.org"}]}

Accessing a particular key/value pair is simple, just as with any other hash:

foo = JSON[js]
foo['url'] # => "https://www.foodpantries.org/"

The page you're referring to has multiple scripts that match the selector I used, so you'll want to filter using a more exact selector, or iterate over the matches and pick the one you want. How to do that is well documented here on SO using CSS, XPath and by Nokogiri's documentation.

Getting link from Mechanize/Nokogiri

First, tell Nokogiri to get a node, rather than a NodeSet. at_css will retrieve the Node and css retrieves a NodeSet, which is like an Array.

Instead of:

website = business.css('.website-feature')

Try:

website = at_css('a.track-visit-website no-tracks')

to retrieve the first instance of an <a> node with class="website-feature". If it's not the first instance you want then you'll need to narrow it down by grabbing the NodeSet and then indexing into it. Without the surrounding HTML it's difficult to help more.

To get the href parameter from a Node, simply treat the node like a hash:

website['href']

should return:

http://urlofsite.com

Here's a little sample from IRB:

irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0>
irb(main):003:0* html = '<a class="this_node" href="http://example.com">'
=> "<a class=\"this_node\" href=\"http://example.com\">"
irb(main):004:0> doc = Nokogiri::HTML.parse(html)
=> #<Nokogiri::HTML::Document:0x8041e2ec name="document" children=[#<Nokogiri::XML::DTD:0x8041d20c name="html">, #<Nokogiri::XML::Element:0x805a2a14 name="html" children=[#<Nokogiri::XML::Element:0x805df8b0 name="body" children=[#<Nokogiri::XML::Element:0x8084c5d0 name="a" attributes=[#<Nokogiri::XML::Attr:0x80860170 name="class" value="this_node">, #<Nokogiri::XML::Attr:0x8086047c name="href" value="http://example.com">]>]>]>]>
irb(main):005:0>
irb(main):006:0* doc.at_css('a.this_node')['href']
=> "http://example.com"
irb(main):007:0>

How do I scrape data through Mechanize and Nokogiri?

Using Nokogiri I would go as below:

Using CSS Selectors

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.screener.in/company/?q=Atul+Auto+Ltd.'))

doc.class
# => Nokogiri::HTML::Document
doc.css('.table.draggable.table-striped.table-hover tr.strong td').class
# => Nokogiri::XML::NodeSet

row_data = doc.css('.table.draggable.table-striped.table-hover tr.strong td').map do |tdata|
tdata.text
end

#From the webpage I took the below value from the table
#*Peer Comparison Top 7 companies in the same business*

row_data
# => ["6.",
# "Atul Auto Ltd.",
# "193.45",
# "8.36",
# "216.66",
# "3.04",
# "7.56",
# "81.73",
# "96.91",
# "17.24",
# "2.92"]

Looking at the table from the webpage I can see CMP/BV and CMP are the twelfth and third columns respectively. Now I can get the data from the array row_data. So CMP is the second index and CMP/BV is the last value of the array row_data.

row_data[2] # => "193.45" #CMP
row_data.last # => "2.92" #CMP/BV

Using XPATH

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.screener.in/company/?q=Atul+Auto+Ltd.'))

p doc.at_xpath("//*[@id='peers']/table/tbody/tr[6]/td[3]").text
p doc.at_xpath("//*[@id='peers']/table/tbody/tr[6]/td[10]").text
# >> "193.45" #CMP
# >> "17.24" #CMP/BV

Nokogiri and Mechanize help (navigating to pages via div class and scraping)

It's important to make sure the a[:href]'s are converted to absolute urls first though.
Therefore, maybe:

page.search('.subtitleLink a').map{|a| page.uri.merge a[:href]}.each do |uri|
page2 = agent.get uri
end

How to parse forms with Mechanize or Nokogiri from a string

You have to unescape your example HTML string, then search the input with the name IW_SessionID_.

This sample code works for me:

#!/usr/bin/ruby

require 'pp'
require 'nokogiri'
require 'mechanize'

r = '<HTML><HEAD><TITLE></TITLE><meta http-equiv="cache-control" content="no-cache">\r\n<meta http-equiv="pragma" content="no-cache">\r\n<NOSCRIPT><HTML><BODY>Your browser does not seem to support JavaScript. Please make sure it is supported and activated</BODY></HTML></NOSCRIPT>\r\n<SCRIPT>\r\nvar ie4 = (document.all)? true:false;\r\nvar ns6 = (document.getElementById)? true && !ie4:false;\r\nfunction Initialize() {\r\nvar lWidth;\r\nvar lHeight;\r\nif (ns6) {\r\n lWidth = window.innerWidth - 30;\r\n lHeight = window.innerHeight - 30;\r\n} else {\r\n lWidth = document.body.clientWidth;\r\n lHeight = document.body.clientHeight;\r\n if (lWidth == 0) { lWidth = undefined;}\r\n if (lHeight == 0) { lHeight = undefined;}\r\n}\r\ndocument.forms[0].elements["IW_width"].value = lWidth;\r\ndocument.forms[0].elements["IW_height"].value = lHeight;\r\ndocument.forms[0].submit();\r\n}</SCRIPT></HEAD><BODY onload="Initialize()">\r\n<form method=post action="/bwtem">\r\n<input type=hidden name="IW_width">\r\n<input type=hidden name="IW_height">\r\n<input type=hidden name="IW_SessionID_" value="1wqzj1f0vec57r1apfqg51wzs88c">\r\n<input type=hidden name="IW_TrackID_" value="0">\r\n</form></BODY></HTML>'

page = Nokogiri::HTML r
input = page.css('input[name="IW_SessionID_"]').first
puts input[:value]

How to extract text from script tag by using nokogiri and mechanize?

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.booking.com/hotel/us/solera-by-stay-alfred.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmcgV1c19ueYgBAZgBMbgBBMgBBNgBAegBAfgBAg;sid=695d6598485cb1a8fd9e39c5de3878ba;dcid=4;checkin=2015-10-20;checkout=2015-10-21;dist=0;group_adults=2;room1=A%2CA;sb_price_type=total;srfid=cf5d76283b73d34a1d7e0d61cad6974e38a94351X1;type=total;ucfs=1&')

match = agent.page.search("script").text.scan(/^booking.env.b_hotel_id = \'.*\'/)
puts match
puts match[0].split("'")[1]

Output:

booking.env.b_hotel_id = '1202411'
1202411

Pages that helped me figure this out:

http://robdodson.me/crawling-pages-with-mechanize-and-nokogiri/

Parsing javascript function elements with nokogiri

Regular expression - starting and ending with a character string

http://www.rubular.com



Related Topics



Leave a reply



Submit