How to Parse JavaScript Using Nokogiri and Ruby

How to parse JavaScript using Nokogiri and Ruby

If I read you correctly you're trying to parse the JavaScript and get a Ruby array with your image URLs yes?

Nokogiri only parses HTML/XML so you're going to need a different library; A cursory search turns up the RKelly library which has a parse function that takes a JavaScript string and returns a parse tree.

Once you have a parse tree you're going to need to traverse it and find the nodes of interest by name (e.g. _arPic) then get the string content on the other side of the assignment.

Alternatively, if it doesn't have to be too robust (and it wouldn't be) you can just use a regex to search the JavaScript if possible:

/^\s*_arPic\[\d\] = "(.+)";$/

might be a good starter regex.

Scraping with Nokogiri and Ruby before and after JavaScript changes the value

If you don't want to use PhantomJS you can also use the network sniffer on Firefox or Chrome development tools, and you will see that the HTML table data is returned with a javascript POST request to the server.

Then instead of opening the original page URL with Nokogiri, you'd instead run this POST from your Ruby script and parse and interpret that data instead. It looks like it's just JSON data with HTML embedded into it. You could extract the HTML and feed that to Nokogiri.

It requires a bit of extra detective work, but I've used this method many times with JavaScript web pages and scraping. It works OK for most simple tasks, but it requires a bit of digging into the inner workings of the page and network traffic.

Here's an example of the JSON data from the Javascript POST request:

Bonds:

https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ

CDS:

https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=cds&ClientCode=WSJ

Here's the quick and dirty solution just so you get an idea. This will grab the cookie from the initial page and use it in the request to get the JSON data, then parse the JSON data and feed the extracted HTML to Nokogiri:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'json'

# Open the initial page to grab the cookie from it
p1 = open('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj')

# Save the cookie
cookie = p1.meta['set-cookie'].split('; ',2)[0]

# Open the JSON data page using our cookie we just obtained
p2 = open('https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ',
'Cookie' => cookie)

# Get the raw JSON
json = p2.read

# Parse it
data = JSON.parse(json)

# Feed the html portion to Nokogiri
doc = Nokogiri.parse(data['html'])

# Extract the values
values = doc.css('td.col2 span')
puts values.map(&:text).inspect

=> ["0.02%", "0.02%", "n.a.", "-0.03%", "0.02%", "0.04%",
"0.01%", "0.02%", "0.08%", "-0.01%", "0.03%", "0.01%", "0.05%", "0.04%"]

How can I parse a page with Nokogiri when content is loaded using Javascript?

The data is json so you don't use nokogiri. For example:

require 'open-uri'
require 'json'
hash = JSON.parse open('http://api.twitch.tv/kraken/games/top?limit=10&on_site=1').read

How to scrape script tags with Nokogiri and Mechanize

Mechanize is overkill if all you are using it for is to retrieve a page. There are many HTTP client gems that'll easily do that, or use OpenURI which is part of Ruby's standard library.

This is the basics for retrieving the information. You'll need to figure out which particular script you want but Nokogiri's tutorials will give you the basics:

require 'json'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('https://www.foodpantries.org/li/st._paul_the_apostle_11368'))

At this point Nokogiri has a DOM created of the page in memory.

Find the <script> node you want, and extract the text of the node:

js = doc.at('script[type="application/ld+json"]').text

at and search are the workhorses for parsing a page. There are CSS and XPath specific variants, but generally you can use the generic versions and Nokogiri will figure out which to use. All are documented on the same page as at and search and the tutorials.

JSON is smart and allows us to use a shorthand of JSON[...] to parse or generate a JSON string. In this case it's parsing a string back into a Ruby object, which in this instance is a hash:

JSON[js]
# => {"@context"=>"https://schema.org",
# "@type"=>"Organization",
# "url"=>"https://www.foodpantries.org/",
# "sameAs"=>[],
# "contactPoint"=>
# [{"@type"=>"ContactPoint",
# "contactType"=>"customer service",
# "url"=>"https://www.foodpantries.org/ar/about",
# "email"=>"webmaster@foodpantries.org"}]}

Accessing a particular key/value pair is simple, just as with any other hash:

foo = JSON[js]
foo['url'] # => "https://www.foodpantries.org/"

The page you're referring to has multiple scripts that match the selector I used, so you'll want to filter using a more exact selector, or iterate over the matches and pick the one you want. How to do that is well documented here on SO using CSS, XPath and by Nokogiri's documentation.

Parsing javascript function elements with nokogiri

Nokogiri doesn't do JavaScript parsing, but this is not too hard to parse with a regular expression:

element = agent.page.search("script")[7]
text = element.text # not 100% sure on this line. Just need the script text though.
Hash[text.scan(/countdownFactory.create\('(\d+)', '(\d+)', ''\)/)]

Nokogiri - find the value inside a javascript array

You can do this pretty easily:

require 'nokogiri'

doc = Nokogiri::HTML('<script>
var foo = [bar, [a, b, c , d], value, some value, . . ]
</script>
')

js = doc.at('script').text
right_side = js.split('=', 2).last
b = right_side.split(',')[2]
b # => " b"

Testing with a real value:

require 'nokogiri'

doc = Nokogiri::HTML('<script>
var foo = [bar, [a, 123, c , d], value, some value, . . ]
</script>
')

js = doc.at('script').text
right_side = js.split('=', 2).last
b = right_side.split(',')[2]
b # => " 123"
b.to_i # => 123

The downside is it's susceptible to changes in the JavaScript string formatting, which makes it fragile. You get to decide whether you want to go down that path.

Remember, all content in HTML source is a string, so you can tear things up using normal string processing once you've narrowed down what you want to look at.

I can't parse the page and get links Nokogiri

I found a more interesting solution )) for example:
link_driver = Nokogiri::HTML(page.source).at('a:contains("mac")').values.join('')
chromedriver_storage_page = 'https://chromedriver.storage.googleapis.com/'
File.new('filename.zip', 'w') << URI.parse(chromedriver_storage+link).read

contains("mac") can change contains("linux") or contains("win"), it does not matter, choose any version of the operating system

And 2 solution - to parse the page chromedriver.chromium.org and to get information about all versions. If the version on the site is more recent than mine, then I substitute the version number in a new line for downloading

chromedriver_storage = 'https://chromedriver.storage.googleapis.com/'
chromedriver = '79.0.3945.36/' - get using Capybara and cut only the version

zip = 'chromedriver_mac64.zip'
link = chromedriver_storage+chromedriver+zip File.new('filename.zip', 'w') << URI.parse(link).read

it turns out that the parser, in headless mode, can be inserted into the crontab task to update the version of the current browser

How to parse a JavaScript-based page

You need to use a Watir gem for this one since it is loaded through ajax.
And also, it seems they have an API, you may also want to take look at this.

ruby nokogiri restclient to scrape javascript variable

not sure if that fits, but you could retrieve it as follows:

irb(main):017:0>

string
=> "<script type=\"text/javascript\"> $(function(){$.Somenamespace.theCurrency = \"EUR\"}); "

irb(main):018:0>

string.scan(/\$\.Somenamespace\.(.*)}\);/)
=> [["theCurrency = \"EUR\""]]


Related Topics



Leave a reply



Submit