Is it possible to plug a JavaScript engine with Ruby and Nokogiri?
You are looking for Watir which runs a real browser and allows you to perform every action you can think of on a web page. There's a similar project called Selenium.
You can even use Watir with a so-called 'headless' browser on a linux machine.
Watir headless example
Suppose we have this HTML:
<p id="hello">Hello from HTML</p>
and this Javascript:
document.getElementById('hello').innerHTML = 'Hello from JavaScript';
(Demo: http://jsbin.com/ivihur)
and you wanted to get the dynamically inserted text. First, you need a Linux box with xvfb
and firefox
installed, for example on Ubuntu do:
$ apt-get install xvfb firefox
You will also need the watir-webdriver
and headless
gems so go ahead and install them as well:
$ gem install watir-webdriver headless
Then you can read the dynamic content from the page with something like this:
require 'rubygems'
require 'watir-webdriver'
require 'headless'
headless = Headless.new
headless.start
browser = Watir::Browser.new
browser.goto 'http://jsbin.com/ivihur' # our example
el = browser.element :css => '#hello'
puts el.text
browser.close
headless.destroy
If everything went right, this will output:
Hello from JavaScript
I know this runs a browser in the background as well, but it's the easiest solution to your problem i could come up with. It will take quite a while to start the browser, but subsequent requests are quite fast. (Running goto
and then fetching the dynamic text above multiple times took about 0.5 sec for each request on my Rackspace Cloud Server).
Source: http://watirwebdriver.com/headless/
Scraping data using Nokogiri
The .
for the class main-header
is missing. It should be
doc.at_css('.main-header span').text
How do I access HTML elements that are rendered in JavaScript using XPath?
Using gem "capybara-webkit" is a viable way of manipulating this website in full javascript rendered view.
Here is a scratch example of what a capybara-webkit script might look like.
#!/usr/bin/env ruby
require "rubygems"
require "pp"
require "bundler/setup"
require "capybara"
require "capybara/dsl"
require "capybara-webkit"
Capybara.run_server = false
Capybara.current_driver = :webkit
Capybara.app_host = "http://www.goalzz.com/"
module Test
class Goalzz
include Capybara::DSL
def get_results
visit('/default.aspx?c=8358')
all(:xpath, '//td[@class="m_g"]').each { |node| pp node.to_s }
end
end
end
spider = Test::Goalzz.new
spider.get_results
What is required to find the example xpath in this case (due to the page being created dynamically), is a fully functional javascript webdriving engine.
How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom
method to IO
which is also available on StringIO
.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
Related Topics
In R, How to Display Value on the Links/Paths of Sankey Graph
How to Scrape Data Using Ruby Which Is Generated by a JavaScript Function
Creating Range in JavaScript - Strange Syntax
Deprecation Warning in Moment.Js - Not in a Recognized Iso Format
JavaScript Double Colon (Bind Operator)
React - Changing an Uncontrolled Input
Clear Text Selection with JavaScript
Using Queryselectorall to Retrieve Direct Children
What Is the "Best" Way to Get and Set a Single Cookie Value Using JavaScript
Can Nokogiri Interpret JavaScript? - Web Scraping
How to Set Bootstrap Navbar Active Class with Angular Js
Calculating Page Load Time in JavaScript
What Is the Lifecycle of an Angularjs Controller
The Definitive Best Way to Preload Images Using JavaScript/Jquery
How to Geocode 20 Addresses Without Receiving an Over_Query_Limit Response