How to Make a Simple Crawler in PHP

How do I make a simple crawler in PHP?

Meh. Don't parse HTML with regexes.

Here's a DOM version inspired by Tatu's:

<?php
function crawl_page($url, $depth = 5)
{
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}

$seen[$url] = true;

$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);

$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '@';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= dirname($parts['path'], 1).$path;
}
}
crawl_page($href, $depth - 1);
}
echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);

Edit: I fixed some bugs from Tatu's version (works with relative URLs now).

Edit: I added a new bit of functionality that prevents it from following the same URL twice.

Edit: echoing output to STDOUT now so you can redirect it to whatever file you want

Edit: Fixed a bug pointed out by George in his answer. Relative urls will no longer append to the end of the url path, but overwrite it. Thanks to George for this. Note that George's answer doesn't account for any of: https, user, pass, or port. If you have the http PECL extension loaded this is quite simply done using http_build_url. Otherwise, I have to manually glue together using parse_url. Thanks again George.

Using a PHP web-crawler to find certain words without certain elements

Do you want to find all paragraphs/text that contains your given word?

<?php 
include('simple_html_dom.php');

$html = file_get_html('https://adityadees.blogspot.com/');

$strings_array = array();

//it searches for any (*) tag with text yang in it
foreach($html->find('*[plaintext*=yang]') as $element) {
//take only elements which doesn't have childnodes, so are last ones in recursion
if ($element->firstChild() == null) {
//there still are duplicate strings so add only unique values to an array
if (!in_array($element->innertext, $strings_array)) {
$strings_array[] = $element->innertext;

}
}
}

echo '<pre>';
print_r($strings_array);
echo '</pre>';

?>

It isn't final solution, but something to start with.
At least it finds word yang 61 times - same as in html source of given page.

Web crawling using PHP

Maybe a good idea to increase the interval dinamically everytime an exception occurs and try again, something like:

foreach ($listElements as $element) {
echo "Running..";
$article_url = $element->url;
$article_page = new simple_html_dom();
$interval = 0;
$tries = 0;
$success = false;

while (!$suceess && $tries < 5) {
try {
sleep($interval);
$article_page->load_file($article_url);
$success = true;
} catch (Exception $e) {
$interval += 20;
$tries ++;
$article_page->load_file($article_url);
} finally {
$filename = "raw_file".$file_num.".txt";
$file = fopen("C:\\xampp\\htdocs\\files\\".$filename,"w");
fwrite($file, $article_page);
fclose($file);
$file_num++;
}
}
}

Basic web-crawling question: How to create a list of all pages on a website using php?

For the general approach, check out the answers to these questions:

  • How to write a crawler?
  • How to best develop web crawlers
  • Is there a way to use PHP to crawl links?

In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find <a href=""> tags and parse the URL out of them (See this question for some typical approaches).

Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.

Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

if ( count($anchors->length) > 0 ) {
foreach ( $anchors as $anchor ) {
if ( $anchor->hasAttribute('href') ) {
$url = $anchor->getAttribute('href');

//now figure out whether to processs this
//URL and add it to a list of URLs to be fetched
}
}
}

Finally, rather than write it yourself, see also this question for other resources you could use.

  • is there a good web crawler library available for PHP or Ruby?

Writing a PHP script that open website's pages and stores page's content in variable

Check out http://sourceforge.net/projects/php-crawler/

Or try this simple code that searches for the presence of the Google Analytics tracking code:

// Disable time limit to keep the script running
set_time_limit(0);
// Domain to start crawling
$domain = "http://webdevwonders.com";
// Content to search for existence
$content = "google-analytics.com/ga.js";
// Tag in which you look for the content
$content_tag = "script";
// Name of the output file
$output_file = "analytics_domains.txt";
// Maximum urls to check
$max_urls_to_check = 100;
$rounds = 0;
// Array to hold all domains to check
$domain_stack = array();
// Maximum size of domain stack
$max_size_domain_stack = 1000;
// Hash to hold all domains already checked
$checked_domains = array();

// Loop through the domains as long as domains are available in the stack
// and the maximum number of urls to check is not reached
while ($domain != "" && $rounds < $max_urls_to_check) {
$doc = new DOMDocument();

// Get the sourcecode of the domain
@$doc->loadHTMLFile($domain);
$found = false;

// Loop through each found tag of the specified type in the dom
// and search for the specified content
foreach($doc->getElementsByTagName($content_tag) as $tag) {
if (strpos($tag->nodeValue, $content)) {
$found = true;
break;
}
}

// Add the domain to the checked domains hash
$checked_domains[$domain] = $found;
// Loop through each "a"-tag in the dom
// and add its href domain to the domain stack if it is not an internal link
foreach($doc->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
$href_array = explode("/", $href);
// Keep the domain stack to the predefined max of domains
// and only push domains to the stack that have not been checked yet
if (count($domain_stack) < $max_size_domain_stack &&
$checked_domains["http://".$href_array[2]] === null) {
array_push($domain_stack, "http://".$href_array[2]);
}
};
}

// Remove all duplicate urls from stack
$domain_stack = array_unique($domain_stack);
$domain = $domain_stack[0];
// Remove the assigned domain from domain stack
unset($domain_stack[0]);
// Reorder the domain stack
$domain_stack = array_values($domain_stack);
$rounds++;
}

$found_domains = "";
// Add all domains where the specified search string
// has been found to the found domains string
foreach ($checked_domains as $key => $value) {
if ($value) {
$found_domains .= $key."\n";
}
}

// Write found domains string to specified output file
file_put_contents($output_file, $found_domains);

I found it here.

simple PHP web crawler working on some, certain types of, pages

I'm not sure what your using to download URLs.

I'd recommend using this:

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

I'm fairly sure Google uses 301 or 302 redirects from links in the search results. So you need your crawler to follow redirects. I assume this is the problem.

Using that class, you need to use the option: CURLOPT_FOLLOWLOCATION

See: http://php.net/manual/en/function.curl-setopt.php

Further, if you are planning on scrapping Google, you'll need a lot of sleeps, and or some good proxies. Google blocks automated queries. A way around this somewhat is to pay $100 for Google XML results via Google Custom Search.

PHP crawler for one special HTML element

try this you can use xpath to get your result

$html = '<html>
<body>
<div class="my"> One </div>
<div class="my"> Two </div>
<div class="my"> Three </div>
<div class="other"> NO </div>
<div class="other2"> NO </div>
</body>
</html>';

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[@class="my"]');
foreach ($tags as $tag) {
$node_value = trim($tag->nodeValue);
echo $node_value."<br/>";
}


Related Topics



Leave a reply



Submit