How to parse JavaScript using Nokogiri and Ruby
If I read you correctly you're trying to parse the JavaScript and get a Ruby array with your image URLs yes?
Nokogiri only parses HTML/XML so you're going to need a different library; A cursory search turns up the RKelly library which has a parse
function that takes a JavaScript string and returns a parse tree.
Once you have a parse tree you're going to need to traverse it and find the nodes of interest by name (e.g. _arPic
) then get the string content on the other side of the assignment.
Alternatively, if it doesn't have to be too robust (and it wouldn't be) you can just use a regex to search the JavaScript if possible:
/^\s*_arPic\[\d\] = "(.+)";$/
might be a good starter regex.
Scraping with Nokogiri and Ruby before and after JavaScript changes the value
If you don't want to use PhantomJS you can also use the network sniffer on Firefox or Chrome development tools, and you will see that the HTML table data is returned with a javascript POST request to the server.
Then instead of opening the original page URL with Nokogiri, you'd instead run this POST from your Ruby script and parse and interpret that data instead. It looks like it's just JSON data with HTML embedded into it. You could extract the HTML and feed that to Nokogiri.
It requires a bit of extra detective work, but I've used this method many times with JavaScript web pages and scraping. It works OK for most simple tasks, but it requires a bit of digging into the inner workings of the page and network traffic.
Here's an example of the JSON data from the Javascript POST request:
Bonds:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ
CDS:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=cds&ClientCode=WSJ
Here's the quick and dirty solution just so you get an idea. This will grab the cookie from the initial page and use it in the request to get the JSON data, then parse the JSON data and feed the extracted HTML to Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'json'
# Open the initial page to grab the cookie from it
p1 = open('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj')
# Save the cookie
cookie = p1.meta['set-cookie'].split('; ',2)[0]
# Open the JSON data page using our cookie we just obtained
p2 = open('https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ',
'Cookie' => cookie)
# Get the raw JSON
json = p2.read
# Parse it
data = JSON.parse(json)
# Feed the html portion to Nokogiri
doc = Nokogiri.parse(data['html'])
# Extract the values
values = doc.css('td.col2 span')
puts values.map(&:text).inspect
=> ["0.02%", "0.02%", "n.a.", "-0.03%", "0.02%", "0.04%",
"0.01%", "0.02%", "0.08%", "-0.01%", "0.03%", "0.01%", "0.05%", "0.04%"]
How can I parse a page with Nokogiri when content is loaded using Javascript?
The data is json so you don't use nokogiri. For example:
require 'open-uri'
require 'json'
hash = JSON.parse open('http://api.twitch.tv/kraken/games/top?limit=10&on_site=1').read
How to scrape script tags with Nokogiri and Mechanize
Mechanize is overkill if all you are using it for is to retrieve a page. There are many HTTP client gems that'll easily do that, or use OpenURI which is part of Ruby's standard library.
This is the basics for retrieving the information. You'll need to figure out which particular script you want but Nokogiri's tutorials will give you the basics:
require 'json'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('https://www.foodpantries.org/li/st._paul_the_apostle_11368'))
At this point Nokogiri has a DOM created of the page in memory.
Find the <script>
node you want, and extract the text of the node:
js = doc.at('script[type="application/ld+json"]').text
at
and search
are the workhorses for parsing a page. There are CSS and XPath specific variants, but generally you can use the generic versions and Nokogiri will figure out which to use. All are documented on the same page as at
and search
and the tutorials.
JSON is smart and allows us to use a shorthand of JSON[...]
to parse or generate a JSON string. In this case it's parsing a string back into a Ruby object, which in this instance is a hash:
JSON[js]
# => {"@context"=>"https://schema.org",
# "@type"=>"Organization",
# "url"=>"https://www.foodpantries.org/",
# "sameAs"=>[],
# "contactPoint"=>
# [{"@type"=>"ContactPoint",
# "contactType"=>"customer service",
# "url"=>"https://www.foodpantries.org/ar/about",
# "email"=>"webmaster@foodpantries.org"}]}
Accessing a particular key/value pair is simple, just as with any other hash:
foo = JSON[js]
foo['url'] # => "https://www.foodpantries.org/"
The page you're referring to has multiple scripts that match the selector I used, so you'll want to filter using a more exact selector, or iterate over the matches and pick the one you want. How to do that is well documented here on SO using CSS, XPath and by Nokogiri's documentation.
Parsing javascript function elements with nokogiri
Nokogiri doesn't do JavaScript parsing, but this is not too hard to parse with a regular expression:
element = agent.page.search("script")[7]
text = element.text # not 100% sure on this line. Just need the script text though.
Hash[text.scan(/countdownFactory.create\('(\d+)', '(\d+)', ''\)/)]
Nokogiri - find the value inside a javascript array
You can do this pretty easily:
require 'nokogiri'
doc = Nokogiri::HTML('<script>
var foo = [bar, [a, b, c , d], value, some value, . . ]
</script>
')
js = doc.at('script').text
right_side = js.split('=', 2).last
b = right_side.split(',')[2]
b # => " b"
Testing with a real value:
require 'nokogiri'
doc = Nokogiri::HTML('<script>
var foo = [bar, [a, 123, c , d], value, some value, . . ]
</script>
')
js = doc.at('script').text
right_side = js.split('=', 2).last
b = right_side.split(',')[2]
b # => " 123"
b.to_i # => 123
The downside is it's susceptible to changes in the JavaScript string formatting, which makes it fragile. You get to decide whether you want to go down that path.
Remember, all content in HTML source is a string, so you can tear things up using normal string processing once you've narrowed down what you want to look at.
I can't parse the page and get links Nokogiri
I found a more interesting solution )) for example: link_driver = Nokogiri::HTML(page.source).at('a:contains("mac")').values.join('')
chromedriver_storage_page = 'https://chromedriver.storage.googleapis.com/'
File.new('filename.zip', 'w') << URI.parse(chromedriver_storage+link).read
contains("mac") can change contains("linux") or contains("win"), it does not matter, choose any version of the operating system
And 2 solution - to parse the page chromedriver.chromium.org and to get information about all versions. If the version on the site is more recent than mine, then I substitute the version number in a new line for downloadingchromedriver_storage = 'https://chromedriver.storage.googleapis.com/'
chromedriver = '79.0.3945.36/'
- get using Capybara and cut only the versionzip = 'chromedriver_mac64.zip'
link = chromedriver_storage+chromedriver+zip File.new('filename.zip', 'w') << URI.parse(link).read
it turns out that the parser, in headless mode, can be inserted into the crontab task to update the version of the current browser
How to parse a JavaScript-based page
You need to use a Watir gem for this one since it is loaded through ajax.
And also, it seems they have an API, you may also want to take look at this.
ruby nokogiri restclient to scrape javascript variable
not sure if that fits, but you could retrieve it as follows:
irb(main):017:0>
string
=> "<script type=\"text/javascript\"> $(function(){$.Somenamespace.theCurrency = \"EUR\"}); "
irb(main):018:0>
string.scan(/\$\.Somenamespace\.(.*)}\);/)
=> [["theCurrency = \"EUR\""]]
Related Topics
Js.Erb Not Executing JavaScript But Is Processed Rails
How Would You Overload the [] Operator in JavaScript
When to Use Node.Js VS Sinatra VS Rails
Rendering React Components with Promises Inside the Render Method
How to Reduce JavaScript Object to Only Contain Properties from Interface
Selectionstart/Selectionend on Input Type="Number" No Longer Allowed in Chrome
Drawing a Line with Three.Js Dynamically
Maintain Model of Scope When Changing Between Views in Angularjs
How to Use Jquery in Firefox Extension
Passing Functions to Settimeout in a Loop: Always the Last Value
JavaScript Call to Swift from Uiwebview
Encrypt iOS and Decrypt Node.Js Aes
Ruby Array to JavaScript Array
How Can D3.Transform Be Used in D3 V4
Javascript:Send JSON Object with Ajax
Dynamically Loading a Typescript Class (Reflection for Typescript)