Scraping a Dynamically Loading Website with PHP Curl

Scraping a dynamically loading website with php curl

You need headless browser for this, you can use PHP Wrapper for PhantomJS , here is the link http://jonnnnyw.github.io/php-phantomjs/. This will solve your problem. It has following features:

  • Load webpages through the PhantomJS headless browser
  • View detailed response data including page content, headers, status code etc.
  • Handle redirects
  • View javascript console errors

Hope this helps.

How to scrape any type of website

Continuing my own research for scrapping sites, I was unable to find any perfect solution. But the powerful solution I came up with is to use Phantom JS module with Node JS. You can find this module here.

For installation Guide follow this documentation. Phantom JS is used asynchronously in node JS and then its alot easier to get the results, and really easy to interact with it using, express JS on server side and Ajax or Socket.io on client side to enhance the functionality.

Below is my code which I came up with :

const phantom = require('phantom');const ev = require('events');const event = new ev.EventEmitter();
var MAIN_URL, TOTAL_PAGES, TOTAL_JOBS, PAGE_DATA_COUNTER = 0, PAGE_COUNTER = 0, PAGE_JOBS_DETAILS = [], IND_JOB_DETAILS = [], JOB_NUMBER = 1, CURRENT_PAGE = 1, PAGE_WEIGHT_TIME, CLICK_NEXT_TIME, CURRENT_WEBSITE, CURR_WEBSITE_LINK, CURR_WEBSITE_NAME, CURR_WEBSITE_INDEX, PH_INSTANCE, PH_PAGE;
function InitScrap() {

// Initiate the Data this.init = async function(url) { MAIN_URL = url; PH_INSTANCE = await phantom.create(), PH_PAGE = await PH_INSTANCE.createPage(); console.log("Scrapper Initiated, Please wait...") return "success"; }
// Load the Basic Page First this.loadPage = async function(pageLoadWait) {
var status = await PH_PAGE.open(MAIN_URL), w;
if (status == "success") { console.log("Page Loaded . . ."); if (pageLoadWait !== undefined && pageLoadWait !== null && pageLoadWait !== false) { let p = new Promise(function(res, rej) { setTimeout(async function() { console.log("Page After 5 Seconds"); PH_PAGE.render("new.png"); TOTAL_PAGES = await PH_PAGE.evaluate(function() { return document.getElementsByClassName("flatten pagination useIconFonts")[0].textContent.match(/\d+/g)[1]; }); TOTAL_JOBS = await PH_PAGE.evaluate(function() { return document.getElementsByClassName("jobCount")[0].textContent.match(/\d+/g)[0]; }); res({ p: TOTAL_PAGES, j: TOTAL_JOBS, s: true }); }, pageLoadWait); }) return await p; } }
}
function ScrapData(opts) {
var scrap = new InitScrap();
scrap.init("https://www.google.com/").then(function(init_res) { if (init_res == "success") { scrap.loadPage(opts.pageLoadWait).then(function(load_res) { console.log(load_res); if (load_res.s === true) { scrap.evaluatePage().then(function(ev_page_res) { console.log("Page Title : " + ev_page_res); scrap.evaluateJobsDetails().then(function(ev_jobs_res) { console.log(ev_jobs_res); }) }) } return }) } });
return scrap; }
module.exports = { ScrapData };
}

PHP scraping dynamically loaded content

you can use chrome's network monitor to log the source of the ajax requests and then request those from your webscraper, but this really is a "make shift api" , and will brake if the site changes it's json format, you can use the php function json_decode to decode the json.

  • http://php.net/manual/en/function.json-decode.php

in order to first retrieve the data, you will have to use file_get_contents

  • http://php.net/manual/en/function.file-get-contents.php

but this will only allow GET
If you want more "advanced" options ( like POST ) you will have to look into cURL

  • http://php.net/manual/en/book.curl.php

PHP cURL to get dynamic content

It seems to me like either a timeout or a problem with your Regexp.

Why not stick to file_get_contents like you tried in the first place?

$content = file_get_contents('http://www.new-fresh-proxies.blogspot.com.au');

preg_match_all('/(\d+\.\d+\.\d+\.\d+(:\d+)?)/', $content, $matches);

print_r($matches[1]);

This will print out a list of IPs:

Array
(
[0] => 1.204.168.15:6673
[1] => 1.234.45.130:80
[2] => 1.34.163.101:8080
[3] => 1.34.29.89:8080
[4] => 1.34.8.221:3128
....

Hope that helps.

php script to get a dynamically loaded schedule of a website

You cannot scrape directly from the website page, because looks like the website is using ajax (I guess) to load the data onto the page.
So what I did, I monitor the Network activity on the page using Chrome Developer Tools, and I found this API url:

http://port.hu/tvapi?channel_id=tvchannel-3&i_datetime_from=2017-02-05&i_datetime_to=2017-02-10

It returned JSON strings, and the dev does not secure the API. So no need to scrape anymore, just load the JSON API directly.

Web Scrape a Website at every visit

I stumbled over PHP cUrl which does what I want.

Reference: http://php.net/manual/en/book.curl.php



Related Topics



Leave a reply



Submit