How to Scrape Data Using Ruby Which Is Generated by a JavaScript Function

How to scrape data using Ruby which is generated by a Javascript function?

Looking through the HTML for that page, the table is generated by JSON received as the result of a JavaScript request.

You can figure out what's going on by tracing backwards through the source code of the page. Here's some of what you'll need if you want to retrieve the JSON outside of their JavaScript, however there'll still be work needed to actually do something with it:

  1. Starting with this code:

    require 'open-uri'
    require 'nokogiri'

    doc = Nokogiri::HTML(open('http://www.sgx.com/wps/portal/sgxweb/home/marketinfo/historical_data/derivatives/daily_data'))
    scripts = doc.css('script').map(&:text)

    puts scripts.select{ |s| s['sgxTableGrid'] }

    Look at the text output in an editor. Search for sgxTableGrid. You'll see a line like:

    var tableHeader =  "<table width='100%' class='sgxTableGrid'>"

    Look down a little farther and you'll see:

    var totalRows = data.items.length - 1;

    data comes from the parameter to the function being called, so that's where we start.

  2. Get a unique part of the containing function's name loadGridns_ and search for it. Each time you find it, look for the parameter data, then look to see where data is defined. If it's passed into that method, then search to see what calls it. Repeat that process until you find that the variable isn't passed into the function, and at that point you'll know you're at the method that creates it.

  3. I found myself in a function that starts with loadGridDatans, where it's part of a block that does a xhrPost call to retrieve a URL. That URL is the target you're after, so grab the name of the containing function, and loop through the calls where the URL is passed in, like you did in the above step.

  4. That search ended up on a line that looks like:

    var url = viewByDailyns_7_2AA4H0C090FIE0I1OH2JFH20K1_...
  5. At that point you can start reconstructing the URL you need. Open a JavaScript debugger, like Firebug, and put a break point on that line. Reload the page and JavaScript should stop executing at that line. Single-step, or set breakpoints, and watch the url variable be created until it's in its final form. At that point you have something you can use in OpenURI, which should retrieve the JSON you want.

Notice, their function names might be generated dynamically; I didn't check to see, so trying to use the full name of the function might fail.

They might also be serializing the datetime stamp or using a session-key that's serialized to make the function names unique/more opaque, doing it for a number of reasons.

Even though it's a pain to take this stuff apart, it's also a good lesson in how dynamic pages work.

Scrape a URL for data that is loaded with Javascript using Ruby

What I managed to do is use Watir, a Ruby Wrapper for Selenium to open the page in a browser and then pass the loaded html into Nokogiri for parsing.

How to scrape a web page with dynamic content added by JavaScript?

To get lazy loaded page, scrap the following pages:

http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=31&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=46&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=61&ajax=true
...

require 'rubygems'
require 'nokogiri'
require 'mechanize'
require 'open-uri'

number = 1
while true
url = "http://www.flipkart.com/mens-footwear/shoes" +
"/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&" +
"sid=osp%2Ccil%2Cnit%2Ce1f&start=#{number}&ajax=true"

doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)

products = doc.css(".browse-product")
break if products.size == 0

products.each do |item|
title = item.at_css(".fk-display-block,.title").text.strip
price = (item.at_css(".pu-final").text || '').strip
link = item.at_xpath(".//a[@class='fk-display-block']/@href")
image = item.at_xpath(".//div/a/img/@src")

puts number
puts "#{title} - #{price}"
puts "http://www.flipkart.com#{link}"
puts image
puts "========================"

number += 1
end
end

How to scrape a web page with dynamic content added by JavaScript?

To get lazy loaded page, scrap the following pages:

http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=31&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=46&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=61&ajax=true
...

require 'rubygems'
require 'nokogiri'
require 'mechanize'
require 'open-uri'

number = 1
while true
url = "http://www.flipkart.com/mens-footwear/shoes" +
"/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&" +
"sid=osp%2Ccil%2Cnit%2Ce1f&start=#{number}&ajax=true"

doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)

products = doc.css(".browse-product")
break if products.size == 0

products.each do |item|
title = item.at_css(".fk-display-block,.title").text.strip
price = (item.at_css(".pu-final").text || '').strip
link = item.at_xpath(".//a[@class='fk-display-block']/@href")
image = item.at_xpath(".//div/a/img/@src")

puts number
puts "#{title} - #{price}"
puts "http://www.flipkart.com#{link}"
puts image
puts "========================"

number += 1
end
end

Scraping with Nokogiri and Ruby before and after JavaScript changes the value

If you don't want to use PhantomJS you can also use the network sniffer on Firefox or Chrome development tools, and you will see that the HTML table data is returned with a javascript POST request to the server.

Then instead of opening the original page URL with Nokogiri, you'd instead run this POST from your Ruby script and parse and interpret that data instead. It looks like it's just JSON data with HTML embedded into it. You could extract the HTML and feed that to Nokogiri.

It requires a bit of extra detective work, but I've used this method many times with JavaScript web pages and scraping. It works OK for most simple tasks, but it requires a bit of digging into the inner workings of the page and network traffic.

Here's an example of the JSON data from the Javascript POST request:

Bonds:

https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ

CDS:

https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=cds&ClientCode=WSJ

Here's the quick and dirty solution just so you get an idea. This will grab the cookie from the initial page and use it in the request to get the JSON data, then parse the JSON data and feed the extracted HTML to Nokogiri:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'json'

# Open the initial page to grab the cookie from it
p1 = open('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj')

# Save the cookie
cookie = p1.meta['set-cookie'].split('; ',2)[0]

# Open the JSON data page using our cookie we just obtained
p2 = open('https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ',
'Cookie' => cookie)

# Get the raw JSON
json = p2.read

# Parse it
data = JSON.parse(json)

# Feed the html portion to Nokogiri
doc = Nokogiri.parse(data['html'])

# Extract the values
values = doc.css('td.col2 span')
puts values.map(&:text).inspect

=> ["0.02%", "0.02%", "n.a.", "-0.03%", "0.02%", "0.04%",
"0.01%", "0.02%", "0.08%", "-0.01%", "0.03%", "0.01%", "0.05%", "0.04%"]

How to scrape data from a website using Nokogiri

Go to that page, open your development tools and when you find the response of the request to KM89.xml under Network tab you'll see that it's not returning HTML, but XML like this one:

<?xml version="1.0" encoding="ISO-8859-1"?> 
<?xml-stylesheet href="latest_ob.xsl" type="text/xsl"?>
<current_observation version="1.0"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://www.weather.gov/view/current_observation.xsd">
<credit>NOAA's National Weather Service</credit>
<credit_URL>http://weather.gov/</credit_URL>
<image>
<url>http://weather.gov/images/xml_logo.gif</url>
<title>NOAA's National Weather Service</title>
<link>http://weather.gov</link>
</image>
<suggested_pickup>15 minutes after the hour</suggested_pickup>
<suggested_pickup_period>60</suggested_pickup_period>
<location>Dexter B Florence Memorial Field Airport, AR</location>
<station_id>KM89</station_id>
<latitude>34.1</latitude>
<longitude>-93.07</longitude>
<observation_time>Last Updated on Nov 23 2012, 7:56 am CST</observation_time>
<observation_time_rfc822>Fri, 23 Nov 2012 07:56:00 -0600</observation_time_rfc822>
<weather>Light Rain</weather>
<temperature_string>57.0 F (13.8 C)</temperature_string>
<temp_f>57.0</temp_f>
<temp_c>13.8</temp_c>
<relative_humidity>87</relative_humidity>
<wind_string>Northeast at 8.1 MPH (7 KT)</wind_string>
<wind_dir>Northeast</wind_dir>
<wind_degrees>30</wind_degrees>
<wind_mph>8.1</wind_mph>
<wind_kt>7</wind_kt>
<pressure_string>1027.5 mb</pressure_string>
<pressure_mb>1027.5</pressure_mb>
<pressure_in>30.30</pressure_in>
<dewpoint_string>52.9 F (11.6 C)</dewpoint_string>
<dewpoint_f>52.9</dewpoint_f>
<dewpoint_c>11.6</dewpoint_c>
<windchill_string>55 F (13 C)</windchill_string>
<windchill_f>55</windchill_f>
<windchill_c>13</windchill_c>
<visibility_mi>10.00</visibility_mi>
<icon_url_base>http://forecast.weather.gov/images/wtf/small/</icon_url_base>
<two_day_history_url>http://www.weather.gov/data/obhistory/KM89.html</two_day_history_url>
<icon_url_name>ra1.png</icon_url_name>
<ob_url>http://www.weather.gov/data/METAR/KM89.1.txt</ob_url>
<disclaimer_url>http://weather.gov/disclaimer.html</disclaimer_url>
<copyright_url>http://weather.gov/disclaimer.html</copyright_url>
<privacy_policy_url>http://weather.gov/notice.html</privacy_policy_url>
</current_observation>

So you can scrape it like this:

require 'open-uri'
require 'nokogiri'

url = 'http://w1.weather.gov/xml/current_obs/KM89.xml'
doc = Nokogiri::HTML(open(url))

p doc.at_css('station_id').text

Scraping data using Nokogiri

The . for the class main-header is missing. It should be

doc.at_css('.main-header span').text

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.

Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.

When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".

When browsing through the requests you'll find that the data you're looking for is located at:

https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json

screenshot developer tools

Since this is JSON you don't need "nokogiri" to parse it.

require 'httparty'
require 'json'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)

When executing the above you'll get the exception:

JSON::ParserError ...

This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.

response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"

To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.

require 'httparty'
require 'json'
require 'stringio'

response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...

If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:

data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))

How to get the div tag that is loaded after sometime using Nokogiri

The content you are looking for must be loaded via jQuery or AJAX, and I don't think Nokogiri can handle that.

You should look at the "Watir" gem and use it to open the URL in a browser, which you can then parse with Nokogiri.



Related Topics



Leave a reply



Submit