Phantomjs Page.Content Isn't Retrieving the Page Content

Why is phantomJS unable to get markup from this page?

CasperJS is a helper library for running PhantomJS scripts. PhantomJS has quite outdated web engine that doesn't support modern javascript, so sites will break more and more often in PhantomJS. The target site is one of those that won't work fully in PhantomJS or even Internet Explorer 11.

But using polyfills, libraries that emulate newer js features for older browsers, we can use CasperJS for a little bit longer. Here I inject the excellent core.js library right after creating a page in PhantomJS, but before going to the site. This way our old browser will possess a new set of javascript features.

var casper = require('casper').create({
// it's better to blend with the crowd
pageSettings: {
userAgent: "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
loadImages: false
},
viewportSize : { width: 1280, height: 720 },
verbose: true,
});

// Apply the bandaid in the form of core.js polyfill
casper.on('page.initialized', function() {
casper.page.injectJs('./core.js');
});

casper.start('https://betyetu.co.ke/sportsbook/SOCCER/');

casper.waitFor(function check() {
return this.evaluate(function () {
return document.querySelectorAll('div.events-app__group').length > 1;
});
}, function then() {
var count = this.evaluate(function () {
return document.querySelectorAll('div.events-app__group').length;
});
this.echo('Found elements: ' + count);
casper.capture('screen.jpg');
}, function timeout() {
this.echo('Still timing out before returning element count');
}, 5000);

casper.run();

Found elements: 28

PhantomJS can't retrieve a website's entire html / can't find web elements

PhantomJS is headless browser so many options that you can handle in firefox, IE and chrome will be impossible for example:

Unsupported Features

Support for plugins (such as Flash) was dropped a long time ago. The primary reasons:

  • Pure headless (no X11) makes it impossible to have windowed plugin
    Issues and bugs are hard to debug, due to the proprietary nature of such plugins (binary blobs)
    The following features, due to the nature of PhantomJS, are irrelevant:

  • WebGL would require an OpenGL-capable system. Since the goal of PhantomJS is to become 100% headless and self-contained, this is not acceptable. Using OpenGL emulation via Mesa can overcome the limitation, but then the performance would degrade.

  • Video and Audio would require shipping a variety of different codecs.

  • CSS 3-D needs a perspective-correct implementation of texture mapping. It can’t be implemented without a penalty in performance.

Use phantomJS in R to scrape page with dynamically loaded content

I've split the PhantomJS code in two parts which avoids the error messages. I'm quite confident it is possible to first read and store the website and afterwards lick on the "next page" button and output the new url, but unfortunately this didn't work out without an error message.

The following R code is the most inner scraping loop (retrieves info from pages of one sub-sub-category, calls / changes the PhantomJS scripts accordingly).

   for (i3 in 1:num_prod_pages) {

system('phantomjs readhtml.js') # return html of current page via PhantomJS

### Use Rvest to scrape the downloaded website.

html_filename <- paste0(as.character(i3), '.html') # file generated in PhantomJS readhtml.js
web <- read_html(html_filnamee)
content <- html_nodes(web, 'div.article-pricing') # %>% html_attr('href')
content <- html_text(content) %>% as.data.frame()

### generate URL of next page

url_i3 <- capture.output(system("phantomjs nextpage.js", intern = TRUE)) %>%
.[length(.)] %>% # last line of output contains
str_sub(str_locate(., 'http')[1], -2) # cut '[1] \' at start and ' \" ' at end

# Adapt PhantomJS scripts to new url

lines <- readLines("readhtml.js")
lines[2] <- paste0("var url ='", url_i3 ,"';")
lines[11] <- paste0(" fs.write('", as.character(i3), ".html', page.content, 'w');")
writeLines(lines, "readhtml.js")

lines <- readLines("nextpage.js")
lines[2] <- paste0("var url ='", url_i3 ,"';")
writeLines(lines, "nextpage.js")
}

The following PhantomJS code "readhtml.js" code stores website with current URL locally

var webPage = require('webpage');
var url ='http://www.hornbach.de/shop/Badarmaturen/S476/artikelliste.html';
var fs = require('fs');
var page = webPage.create();
var system = require('system');

//page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0'
page.settings.userAgent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

page.open(url, function(status) {
if (status === 'success') {
fs.write('1.html', page.content, 'w');
console.log('htmlfile ready');
phantom.exit();
}
})

The following PhantomJS code "nextpage.js" code clicks the "next page" button and returns the new URL

var webPage = require('webpage');
var url ='http://www.hornbach.de/shop/Badarmaturen/S476/artikelliste.html';
var fs = require('fs');
var page = webPage.create();
var system = require('system');

page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0';

page.open(url, function(status) {
if (status === 'success') {
page.evaluate(function() {
document.querySelector('a.right:nth-child(3)').click();
});
setTimeout(function() {
var new_url = page.url;
console.log('URL: ' + new_url);
phantom.exit();
}, 2000);
};
});

All in all not really elegant, but lacking other input I close this one as it works without any error messages

How to stop/abort/cancel a page load in PhantomJS?

page.onResourceRequested = function(requestData, request) {
var matchUrlNeeded = ((/someregexforurl\/js/g).test(requestData.url));
if (matchUrlNeeded) {
doStuffWithTheUrl;
response.close();
request.abort();
page.cancel(); }
}
else {
//console.log("NO MATCH : " + requestData.url); request.abort(); } { }

Phantomjs load static content only in page.open

According to this issue you can call request.abort() inside page.onResourceRequested. The example given there, to stop all css being loaded, is:

page.onResourceRequested = function(requestData, request) {
if ((/http:\/\/.+?.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
console.log('The url of the request is matching. Aborting: ' + requestData['url']);
request.abort();
}
}


Related Topics



Leave a reply



Submit