Scraping a dynamically loading website with php curl
You need headless browser for this, you can use PHP Wrapper for PhantomJS , here is the link http://jonnnnyw.github.io/php-phantomjs/. This will solve your problem. It has following features:
- Load webpages through the PhantomJS headless browser
- View detailed response data including page content, headers, status code etc.
- Handle redirects
- View javascript console errors
Hope this helps.
How to scrape any type of website
Continuing my own research for scrapping sites, I was unable to find any perfect solution. But the powerful solution I came up with is to use Phantom JS module with Node JS. You can find this module here.
For installation Guide follow this documentation. Phantom JS is used asynchronously in node JS and then its alot easier to get the results, and really easy to interact with it using, express JS on server side and Ajax or Socket.io on client side to enhance the functionality.
Below is my code which I came up with :
const phantom = require('phantom');const ev = require('events');const event = new ev.EventEmitter();
var MAIN_URL, TOTAL_PAGES, TOTAL_JOBS, PAGE_DATA_COUNTER = 0, PAGE_COUNTER = 0, PAGE_JOBS_DETAILS = [], IND_JOB_DETAILS = [], JOB_NUMBER = 1, CURRENT_PAGE = 1, PAGE_WEIGHT_TIME, CLICK_NEXT_TIME, CURRENT_WEBSITE, CURR_WEBSITE_LINK, CURR_WEBSITE_NAME, CURR_WEBSITE_INDEX, PH_INSTANCE, PH_PAGE;
function InitScrap() {
// Initiate the Data this.init = async function(url) { MAIN_URL = url; PH_INSTANCE = await phantom.create(), PH_PAGE = await PH_INSTANCE.createPage(); console.log("Scrapper Initiated, Please wait...") return "success"; }
// Load the Basic Page First this.loadPage = async function(pageLoadWait) {
var status = await PH_PAGE.open(MAIN_URL), w;
if (status == "success") { console.log("Page Loaded . . ."); if (pageLoadWait !== undefined && pageLoadWait !== null && pageLoadWait !== false) { let p = new Promise(function(res, rej) { setTimeout(async function() { console.log("Page After 5 Seconds"); PH_PAGE.render("new.png"); TOTAL_PAGES = await PH_PAGE.evaluate(function() { return document.getElementsByClassName("flatten pagination useIconFonts")[0].textContent.match(/\d+/g)[1]; }); TOTAL_JOBS = await PH_PAGE.evaluate(function() { return document.getElementsByClassName("jobCount")[0].textContent.match(/\d+/g)[0]; }); res({ p: TOTAL_PAGES, j: TOTAL_JOBS, s: true }); }, pageLoadWait); }) return await p; } }
}
function ScrapData(opts) {
var scrap = new InitScrap();
scrap.init("https://www.google.com/").then(function(init_res) { if (init_res == "success") { scrap.loadPage(opts.pageLoadWait).then(function(load_res) { console.log(load_res); if (load_res.s === true) { scrap.evaluatePage().then(function(ev_page_res) { console.log("Page Title : " + ev_page_res); scrap.evaluateJobsDetails().then(function(ev_jobs_res) { console.log(ev_jobs_res); }) }) } return }) } });
return scrap; }
module.exports = { ScrapData };
}
PHP scraping dynamically loaded content
you can use chrome's network monitor to log the source of the ajax requests and then request those from your webscraper, but this really is a "make shift api" , and will brake if the site changes it's json format, you can use the php function json_decode to decode the json.
- http://php.net/manual/en/function.json-decode.php
in order to first retrieve the data, you will have to use file_get_contents
- http://php.net/manual/en/function.file-get-contents.php
but this will only allow GET
If you want more "advanced" options ( like POST ) you will have to look into cURL
- http://php.net/manual/en/book.curl.php
PHP cURL to get dynamic content
It seems to me like either a timeout or a problem with your Regexp.
Why not stick to file_get_contents
like you tried in the first place?
$content = file_get_contents('http://www.new-fresh-proxies.blogspot.com.au');
preg_match_all('/(\d+\.\d+\.\d+\.\d+(:\d+)?)/', $content, $matches);
print_r($matches[1]);
This will print out a list of IPs:
Array
(
[0] => 1.204.168.15:6673
[1] => 1.234.45.130:80
[2] => 1.34.163.101:8080
[3] => 1.34.29.89:8080
[4] => 1.34.8.221:3128
....
Hope that helps.
php script to get a dynamically loaded schedule of a website
You cannot scrape directly from the website page, because looks like the website is using ajax (I guess) to load the data onto the page.
So what I did, I monitor the Network activity on the page using Chrome Developer Tools, and I found this API url:
http://port.hu/tvapi?channel_id=tvchannel-3&i_datetime_from=2017-02-05&i_datetime_to=2017-02-10
It returned JSON strings, and the dev does not secure the API. So no need to scrape anymore, just load the JSON API directly.
Web Scrape a Website at every visit
I stumbled over PHP cUrl which does what I want.
Reference: http://php.net/manual/en/book.curl.php
Related Topics
Strange Echo, Print Behaviour in PHP
Php: Session Lost on Subdomain
How to Parse Xml Containing Custom Namespaces Using Simplexml
How to Avoid "Using Temporary" in Many-To-Many Queries
Display Only 3 Foreach Result Per Row
Modify an Existing PHP Function to Return a String
PHP Prepared Statement Problems
Disable Template Caching for Development in Opencart 3
Getting Warning "Header May Not Contain More Than a Single Header, New Line Detected"
PHP Removing HTML Tags from String
How to Extend the Zend Navigation Menu View Helper
Get Refunded Orders and Refunded Order Items Details in Woocommerce 3
Is There Any JavaScript Tcp Soket Library for PHP Like Signalr with .Net
How Safe Is PHP Pdo Function: Lastinsertid
Call to Undefined Function MySQLi_Result::Num_Rows()
Calling Perl Script from PHP and Passing in Variables, While Also Using Variablized Perl Script Name