How to Use Curl and PHP Simple HTML Dom Parser with Object

how to use CURL and PHP Simple HTML DOM Parser with object

You're not creating the DOM correctly, you must do it like this:

// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load(curl_exec($ch))

print_r( $dom );

Check the Manual for more details...

Edit

It seems that is a cURL settings problem, please refer to the documentation to configure it correctly...

This is a function I usualy use to download some pages, feel free to adjust it to your needs:

function dlPage($href) {

$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);

// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);

return $dom;
}

$url = 'http://www.example.com/';
$data = dlPage($url);
print_r($data);

PHP Simple HTML DOM and cURL Not Working

The curl function needed an additional setting - namely CURLOPT_FOLLOWLOCATION and the function itself needs to return a value in order that it's values can be used. In the code below I return an object with both the response and the info which allows you to test for the http_code before attempting to process the response data.
This uses standard DOMDocument but no doubt using simple_dom will be easy to do.

function curl_download( $url ) {

$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );/* NEW */
curl_setopt( $ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0" );
curl_setopt( $ch, CURLOPT_HEADER, 0 );
curl_setopt( $ch, CURLOPT_TIMEOUT, 10 );

$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);

return (object)array(
'response' => $output,
'info' => $info
);
}

$output = curl_download( 'http://www.digg.com' );
if( $output->info['http_code']==200 ){

libxml_use_internal_errors( true );

$dom=new DOMDocument;

$dom->preserveWhiteSpace = false;
$dom->validateOnParse = false;
$dom->standalone=true;
$dom->strictErrorChecking=false;
$dom->substituteEntities=true;
$dom->recover=true;
$dom->formatOutput=false;

$dom->loadHTML( $output->response );

libxml_clear_errors();

$xp=new DOMXPath( $dom );
$col=$xp->query('//div[@class="digg-story__kicker"]');
if( !empty( $col ) ){
foreach( $col as $node )echo $node->nodeValue;
}
} else {
echo '<pre>',print_r($output->info,true),'</div>';
}

Updated answer to include error mitigation code offered by libxml - weidly though the code as it was orginally ran without issue locally before adding the libxml error handling code....

Without the CURLOPT_FOLLOWLOCATION set I get:

Array
(
[url] => http://www.digg.com
[content_type] => text/html
[http_code] => 301
[header_size] => 191
[request_size] => 79
[filetime] => -1
[ssl_verify_result] => 0
[redirect_count] => 0
[total_time] => 0.421
[namelookup_time] => 0.031
[connect_time] => 0.234
[pretransfer_time] => 0.234
[size_upload] => 0
[size_download] => 185
[speed_download] => 439
[speed_upload] => 0
[download_content_length] => 185
[upload_content_length] => 0
[starttransfer_time] => 0.421
[redirect_time] => 0
[certinfo] => Array
(
)
)

But with CURLOPT_FOLLOWLOCATION set as true I get

WE'VE SEEN BETTER ANIME TRIBUTE VIDEOS...<more>...RESIST THE URGE TO SUBTWEET A BAD APPLE

Parse output using PHP Simple HTML DOM parser

Try clear or close before the print_r (sorry i forget which, just try one and then the other).

$html->clear();

$html->close();

If that doesn't work, and you know the object you are "find"ing is a certain selector type, you should grab all of those first. Then search through them, that makes it go quicker if you are able to do that. So if it's a <select> find all of those first, then run the find on those, or don't even use find and parse them yourself using foreach and strpos.

If that doesn't work, try fetching your pages first with wget, and then parsing them w/ html simple dom once you have saved them to your server.

Find div with class using PHP Simple HTML DOM Parser

The right code to get a div with class is:

$ret = $html->find('div.foo');
//OR
$ret = $html->find('div[class=foo]');

Basically you can get elements as you were using a CSS selector.

source: http://simplehtmldom.sourceforge.net/manual.htm

How to find HTML elements? section, tab Advanced

PHP/HTML DOM Parser - Fetch a Specific part from a text and then get another string

First try to avoid simple_html_dom that is the worst parser ever (the slowest) and not so simple. Take the time to learn how to use DOMDocument and DOMXPath (there is a ton of tutorials about XPath 1.0) to do the same kind of jobs (note that once you learn that for php, you can use it for a lot of other languages since this is implemented everywhere).

The second step consists to extract the json string and to build a json object.

A general advice: When you have formated datas under the nose, using this format, it is more handy than a string approach.

$url = 'http://www.samplehost.com/samplepage.php';

// discard notices and warnings about badly formated html
libxml_use_internal_errors(true);
$dom = new DOMDocument;
// or get the file content via curl and use $dom->loadHTML($content);
$dom->loadHTMLFile($url);

$xp = new DOMXPath($dom);
// '//' means everywhere in the DOM tree, 'script' is the target node,
// and [...] encloses conditions about this node:
// normalize-space is used here to trim leading spaces,
// the dot refers to the current node content
$qry = '//script[starts-with(normalize-space(.), "var colourVariantsInitialData")]';

// an xpath query returns a nodeList, to get the first (and unique here)
// item of the list, you need to use ->item(0)
$rawtxt = $xp->query($qry)->item(0)->nodeValue;

// extraction of the json string and creation of a json object
$jsonStart = strpos($rawtxt, '[');
$jsonEnd = strrpos($rawtxt, ']');

$collections = json_decode(substr($rawtxt, $jsonStart, $jsonEnd - $jsonStart + 1));

// Then you can easily extract what you want from the json object
echo "collection id: " . $collections[1]->ColVarId . "\n";

foreach ($collections[1]->SizeVariants as $item) {
printf("%-30s\t%s\n", $item->SizeName, $item->ProdSizePrices->SellPrice);
}

simple html dom parser can't get correct value

Your getSslPage function returns a string (the html source code of the $url page).

While the returned value is a string, you're treating it as an object $html->find and therefore the error you get.

Fatal error: Call to a member function find() on a non-object

The Simple Html DOM Parser library has 2 functions to create a DOM object:

  • file_get_html - Create a DOM object from URL
  • str_get_html - Create a DOM object from string

Since you already have the HTML string, just edit your code as the following:

$html = str_get_html($html);
$result = $html->find('.forumbox-header',0);


Related Topics



Leave a reply



Submit