Can Nokogiri Interpret JavaScript? - Web Scraping

Scraping with Nokogiri and Ruby before and after JavaScript changes the value

If you don't want to use PhantomJS you can also use the network sniffer on Firefox or Chrome development tools, and you will see that the HTML table data is returned with a javascript POST request to the server.

Then instead of opening the original page URL with Nokogiri, you'd instead run this POST from your Ruby script and parse and interpret that data instead. It looks like it's just JSON data with HTML embedded into it. You could extract the HTML and feed that to Nokogiri.

It requires a bit of extra detective work, but I've used this method many times with JavaScript web pages and scraping. It works OK for most simple tasks, but it requires a bit of digging into the inner workings of the page and network traffic.

Here's an example of the JSON data from the Javascript POST request:

Bonds:

https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ

CDS:

https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=cds&ClientCode=WSJ

Here's the quick and dirty solution just so you get an idea. This will grab the cookie from the initial page and use it in the request to get the JSON data, then parse the JSON data and feed the extracted HTML to Nokogiri:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'json'

# Open the initial page to grab the cookie from it
p1 = open('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj')

# Save the cookie
cookie = p1.meta['set-cookie'].split('; ',2)[0]

# Open the JSON data page using our cookie we just obtained
p2 = open('https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ',
'Cookie' => cookie)

# Get the raw JSON
json = p2.read

# Parse it
data = JSON.parse(json)

# Feed the html portion to Nokogiri
doc = Nokogiri.parse(data['html'])

# Extract the values
values = doc.css('td.col2 span')
puts values.map(&:text).inspect

=> ["0.02%", "0.02%", "n.a.", "-0.03%", "0.02%", "0.04%",
"0.01%", "0.02%", "0.08%", "-0.01%", "0.03%", "0.01%", "0.05%", "0.04%"]

Scraping data using Nokogiri

The . for the class main-header is missing. It should be

doc.at_css('.main-header span').text

How to scrape data using Ruby which is generated by a Javascript function?

Looking through the HTML for that page, the table is generated by JSON received as the result of a JavaScript request.

You can figure out what's going on by tracing backwards through the source code of the page. Here's some of what you'll need if you want to retrieve the JSON outside of their JavaScript, however there'll still be work needed to actually do something with it:

  1. Starting with this code:

    require 'open-uri'
    require 'nokogiri'

    doc = Nokogiri::HTML(open('http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data'))
    scripts = doc.css('script').map(&:text)

    puts scripts.select{ |s| s['sgxTableGrid'] }

    Look at the text output in an editor. Search for sgxTableGrid. You'll see a line like:

    var tableHeader =  "<table width='100%' class='sgxTableGrid'>"

    Look down a little farther and you'll see:

    var totalRows = data.items.length - 1;

    data comes from the parameter to the function being called, so that's where we start.

  2. Get a unique part of the containing function's name loadGridns_ and search for it. Each time you find it, look for the parameter data, then look to see where data is defined. If it's passed into that method, then search to see what calls it. Repeat that process until you find that the variable isn't passed into the function, and at that point you'll know you're at the method that creates it.

  3. I found myself in a function that starts with loadGridDatans, where it's part of a block that does a xhrPost call to retrieve a URL. That URL is the target you're after, so grab the name of the containing function, and loop through the calls where the URL is passed in, like you did in the above step.

  4. That search ended up on a line that looks like:

    var url = viewByDailyns_7_2AA4H0C090FIE0I1OH2JFH20K1_...
  5. At that point you can start reconstructing the URL you need. Open a JavaScript debugger, like Firebug, and put a break point on that line. Reload the page and JavaScript should stop executing at that line. Single-step, or set breakpoints, and watch the url variable be created until it's in its final form. At that point you have something you can use in OpenURI, which should retrieve the JSON you want.

Notice, their function names might be generated dynamically; I didn't check to see, so trying to use the full name of the function might fail.

They might also be serializing the datetime stamp or using a session-key that's serialized to make the function names unique/more opaque, doing it for a number of reasons.

Even though it's a pain to take this stuff apart, it's also a good lesson in how dynamic pages work.

How to scrape script tags with Nokogiri and Mechanize

Mechanize is overkill if all you are using it for is to retrieve a page. There are many HTTP client gems that'll easily do that, or use OpenURI which is part of Ruby's standard library.

This is the basics for retrieving the information. You'll need to figure out which particular script you want but Nokogiri's tutorials will give you the basics:

require 'json'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('https://www.foodpantries.org/li/st._paul_the_apostle_11368'))

At this point Nokogiri has a DOM created of the page in memory.

Find the <script> node you want, and extract the text of the node:

js = doc.at('script[type="application/ld+json"]').text

at and search are the workhorses for parsing a page. There are CSS and XPath specific variants, but generally you can use the generic versions and Nokogiri will figure out which to use. All are documented on the same page as at and search and the tutorials.

JSON is smart and allows us to use a shorthand of JSON[...] to parse or generate a JSON string. In this case it's parsing a string back into a Ruby object, which in this instance is a hash:

JSON[js]
# => {"@context"=>"https://schema.org",
# "@type"=>"Organization",
# "url"=>"https://www.foodpantries.org/",
# "sameAs"=>[],
# "contactPoint"=>
# [{"@type"=>"ContactPoint",
# "contactType"=>"customer service",
# "url"=>"https://www.foodpantries.org/ar/about",
# "email"=>"webmaster@foodpantries.org"}]}

Accessing a particular key/value pair is simple, just as with any other hash:

foo = JSON[js]
foo['url'] # => "https://www.foodpantries.org/"

The page you're referring to has multiple scripts that match the selector I used, so you'll want to filter using a more exact selector, or iterate over the matches and pick the one you want. How to do that is well documented here on SO using CSS, XPath and by Nokogiri's documentation.

Using Nokogiri to parse JavaScript hidden HTML

Used the google inspector tools to log the XMLHTTPRequests and was easily able to figure out where the data was actually being loaded from. Thanks to @NickVeys!

Is it possible to plug a JavaScript engine with Ruby and Nokogiri?

You are looking for Watir which runs a real browser and allows you to perform every action you can think of on a web page. There's a similar project called Selenium.

You can even use Watir with a so-called 'headless' browser on a linux machine.

Watir headless example

Suppose we have this HTML:

<p id="hello">Hello from HTML</p>

and this Javascript:

document.getElementById('hello').innerHTML = 'Hello from JavaScript';

(Demo: http://jsbin.com/ivihur)

and you wanted to get the dynamically inserted text. First, you need a Linux box with xvfb and firefox installed, for example on Ubuntu do:

$ apt-get install xvfb firefox

You will also need the watir-webdriver and headless gems so go ahead and install them as well:

$ gem install watir-webdriver headless

Then you can read the dynamic content from the page with something like this:

require 'rubygems'
require 'watir-webdriver'
require 'headless'

headless = Headless.new
headless.start
browser = Watir::Browser.new

browser.goto 'http://jsbin.com/ivihur' # our example
el = browser.element :css => '#hello'
puts el.text

browser.close
headless.destroy

If everything went right, this will output:

Hello from JavaScript

I know this runs a browser in the background as well, but it's the easiest solution to your problem i could come up with. It will take quite a while to start the browser, but subsequent requests are quite fast. (Running goto and then fetching the dynamic text above multiple times took about 0.5 sec for each request on my Rackspace Cloud Server).

Source: http://watirwebdriver.com/headless/



Related Topics



Leave a reply



Submit