How to Plug a JavaScript Engine with Ruby and Nokogiri

Is it possible to plug a JavaScript engine with Ruby and Nokogiri?

You are looking for Watir which runs a real browser and allows you to perform every action you can think of on a web page. There's a similar project called Selenium.

You can even use Watir with a so-called 'headless' browser on a linux machine.

Watir headless example

Suppose we have this HTML:

<p id="hello">Hello from HTML</p>

and this Javascript:

document.getElementById('hello').innerHTML = 'Hello from JavaScript';

(Demo: http://jsbin.com/ivihur)

and you wanted to get the dynamically inserted text. First, you need a Linux box with xvfb and firefox installed, for example on Ubuntu do:

$ apt-get install xvfb firefox

You will also need the watir-webdriver and headless gems so go ahead and install them as well:

$ gem install watir-webdriver headless

Then you can read the dynamic content from the page with something like this:

require 'rubygems'
require 'watir-webdriver'
require 'headless'

headless = Headless.new
headless.start
browser = Watir::Browser.new

browser.goto 'http://jsbin.com/ivihur' # our example
el = browser.element :css => '#hello'
puts el.text

browser.close
headless.destroy

If everything went right, this will output:

Hello from JavaScript

I know this runs a browser in the background as well, but it's the easiest solution to your problem i could come up with. It will take quite a while to start the browser, but subsequent requests are quite fast. (Running goto and then fetching the dynamic text above multiple times took about 0.5 sec for each request on my Rackspace Cloud Server).

Source: http://watirwebdriver.com/headless/

Scraping data using Nokogiri

The . for the class main-header is missing. It should be

doc.at_css('.main-header span').text

How do I access HTML elements that are rendered in JavaScript using XPath?

Using gem "capybara-webkit" is a viable way of manipulating this website in full javascript rendered view.

Here is a scratch example of what a capybara-webkit script might look like.

#!/usr/bin/env ruby
require "rubygems"
require "pp"
require "bundler/setup"
require "capybara"
require "capybara/dsl"
require "capybara-webkit"

Capybara.run_server = false
Capybara.current_driver = :webkit
Capybara.app_host = "http://www.goalzz.com/"

module Test
class Goalzz
include Capybara::DSL

def get_results
visit('/default.aspx?c=8358')
all(:xpath, '//td[@class="m_g"]').each { |node| pp node.to_s }

end
end
end

spider = Test::Goalzz.new
spider.get_results

What is required to find the example xpath in this case (due to the page being created dynamically), is a fully functional javascript webdriving engine.

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.

Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.

When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".

When browsing through the requests you'll find that the data you're looking for is located at:

https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json

screenshot developer tools

Since this is JSON you don't need "nokogiri" to parse it.

require 'httparty'
require 'json'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)

When executing the above you'll get the exception:

JSON::ParserError ...

This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.

response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"

To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.

require 'httparty'
require 'json'
require 'stringio'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...

If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:

data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))


Related Topics



Leave a reply



Submit