How to Get Content of Remote HTML Page

How to get Content of Remote HTML page

Use CURL to read the remote URL to fetch the HTML.

$url = "http://www.example.com";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$output = curl_exec($curl);
curl_close($curl);

Then use PHP's DOM object model to parse the HTML.

For example to fetch all <h1> tags from the source,

$DOM = new DOMDocument;
$DOM->loadHTML( $output);

//get all H1
$items = $DOM->getElementsByTagName('h1');

//display all H1 text
for ($i = 0; $i < $items->length; $i++)
echo $items->item($i)->nodeValue . "<br/>";

Get HTML content of remote page with JS

First of all, your web server needs to return the HTTP header

Access-Control-Allow-Origin, you can learn more about this here: CORS. Access-Control-Allow-Origin: * to allow all websites to access your web server. Then, you can use XMLHttpRequest:

function onReceive() {
console.log(this.responseText);
}

const req = new XMLHttpRequest();
req.addEventListener("load", onReceive);
req.open("GET", "https://quarantine.country/coronavirus/cases/usa/");
req.send();

Edit: If you do not control the quarantine.country website, this is not possible without their collaboration or a web server of your own.

Get Content of Remote HTML page

First, you can skip the curl part. DOMDocument has the method loadHTMLFile() to load even remote html files. Just use:

$DOM = new DOMDocument();
$DOM->loadHTMLFile($url);
// If the remote page might not being valid against HTML standards,
// you might want to use the "silence operator" : @
@$DOM->loadHTMLFile($url);

If you want to select an element by it's attribute value, you use XPath:

$selector = new DOMXPath($DOM);
$element = $selector->query('//td[@headers="superHero"]')->item(0);

Get content from remote html page

Find scripts in html source by DomDocument and then variable declaration by regex

$DOM = new DomDocument();
$DOM->loadHTML( $output);

$res = [];
$scripts = $DOM->getElementsByTagName('script');
$lnt = $scripts->length;
for($i=0; $i < $lnt; $i++) {
preg_match_all('/var\s+(\w+)\s*=\s*(\d+)\s*;/', $DOM->saveHtml($scripts->item($i)), $m);
$res = array_merge($res, array_combine($m[1], $m[2]));
}
print_r($res);

demo

PHP - display remote page's content in full

Since only Nikolay Ganovski offered a solution, I wrote a code which converts partial pages into full by looking for incomplete css/img/form tags and making them full. In case someone needs it, find the code below:

//finalizes remote page by completing incomplete css/img/form URLs (path/file.css becomes http://somedomain.com/path/file.css, etc.)
function finalize_remote_page($content, $root_url)
{
$root_url_without_scheme=preg_replace('/(?:https?:\/\/)?(?:www\.)?(.*)\/?$/i', '$1', $root_url); //ignore schemes, in case URL provided by user was http://domain.com while URL in source is https://domain.com (or vice-versa)

$content_object=str_get_html($content);
if (is_object($content_object))
{
foreach ($content_object->find('link.[rel=stylesheet]') as $entry) //find css
{
if (substr($entry->href, 0, 2)!="//" && stristr($entry->href, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com
{
$entry->href=$root_url.$entry->href;
}
}

foreach ($content_object->find('img') as $entry) //find img
{
if (substr($entry->src, 0, 2)!="//" && stristr($entry->src, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com
{
$entry->src=$root_url.$entry->src;
}
}

foreach ($content_object->find('form') as $entry) //find form
{
if (substr($entry->action, 0, 2)!="//" && stristr($entry->action, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com
{
$entry->action=$root_url.$entry->action;
}
}
}

return $content_object;
}

Using jQuery, how can I get specific text from a remote HTML file?

Do it the same way jquery does with .load:

$.get('/article.html', function(data) {

console.log( $("<div>").html(data).find(".copy").text() );

});

or even

$("#myEl").load("/article.html .copy");

How to get the content of a remote page with JavaScript?

Same domain policy is going to get you.

1) Proxy through your server. browser->your server->their server->your server->browser.

2) Use flash or silverlight. The 3rd party has to give you access. The bridge between javascript and flash isn't great for large amounts of data and there are bugs. Silverlight isn't ubiquitous like flash...

3) use a tag. This really isn't safe... Only works if 3rd party content is valid javascript.

Pull HTML content from remote website and display on page

Extracting a fragment of HTML from a website is a breeze with simplehtmldom you can then do something like:

function pullRaspi_SDImageTable() {
$filename = '/tmp/downloads.html'; /// Where you want to cache the result
$expiry = 600; // 10 minutes
$output = '';

if (!file_exists($filename) || time() - $expiry > filemtime($filename)) {
// There is no cache, so fetch the results from remote server
require_once('simple_html_dom.php');
$html = file_get_html('http://www.raspberrypi.org/downloads');
foreach($html->find('div.entry-content table.table') as $elem) {
$output .= (string)$elem;
}

// Store the cache
file_put_contents($filename, $output);
} else {
// Pull the content from the cahce
$output = file_get_contents($filename);
}

return $output;
}

Which will give you the table.table HTML

Including remote HTML file

Try cURL:

http://php.net/manual/en/book.curl.php

Basic example:

$ch = curl_init("http://www.example.com/");
$fp = fopen("example_homepage.txt", "w");

curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);

curl_exec($ch);
curl_close($ch);
fclose($fp);

Ref: http://php.net/manual/en/curl.examples-basic.php



Related Topics



Leave a reply



Submit