How do I make a simple crawler in PHP?
Meh. Don't parse HTML with regexes.
Here's a DOM version inspired by Tatu's:
<?php
function crawl_page($url, $depth = 5)
{
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '@';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= dirname($parts['path'], 1).$path;
}
}
crawl_page($href, $depth - 1);
}
echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);
Edit: I fixed some bugs from Tatu's version (works with relative URLs now).
Edit: I added a new bit of functionality that prevents it from following the same URL twice.
Edit: echoing output to STDOUT now so you can redirect it to whatever file you want
Edit: Fixed a bug pointed out by George in his answer. Relative urls will no longer append to the end of the url path, but overwrite it. Thanks to George for this. Note that George's answer doesn't account for any of: https, user, pass, or port. If you have the http PECL extension loaded this is quite simply done using http_build_url. Otherwise, I have to manually glue together using parse_url. Thanks again George.
Using a PHP web-crawler to find certain words without certain elements
Do you want to find all paragraphs/text that contains your given word?
<?php
include('simple_html_dom.php');
$html = file_get_html('https://adityadees.blogspot.com/');
$strings_array = array();
//it searches for any (*) tag with text yang in it
foreach($html->find('*[plaintext*=yang]') as $element) {
//take only elements which doesn't have childnodes, so are last ones in recursion
if ($element->firstChild() == null) {
//there still are duplicate strings so add only unique values to an array
if (!in_array($element->innertext, $strings_array)) {
$strings_array[] = $element->innertext;
}
}
}
echo '<pre>';
print_r($strings_array);
echo '</pre>';
?>
It isn't final solution, but something to start with.
At least it finds word yang 61 times - same as in html source of given page.
Web crawling using PHP
Maybe a good idea to increase the interval dinamically everytime an exception occurs and try again, something like:
foreach ($listElements as $element) {
echo "Running..";
$article_url = $element->url;
$article_page = new simple_html_dom();
$interval = 0;
$tries = 0;
$success = false;
while (!$suceess && $tries < 5) {
try {
sleep($interval);
$article_page->load_file($article_url);
$success = true;
} catch (Exception $e) {
$interval += 20;
$tries ++;
$article_page->load_file($article_url);
} finally {
$filename = "raw_file".$file_num.".txt";
$file = fopen("C:\\xampp\\htdocs\\files\\".$filename,"w");
fwrite($file, $article_page);
fclose($file);
$file_num++;
}
}
}
Basic web-crawling question: How to create a list of all pages on a website using php?
For the general approach, check out the answers to these questions:
- How to write a crawler?
- How to best develop web crawlers
- Is there a way to use PHP to crawl links?
In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find <a href="">
tags and parse the URL out of them (See this question for some typical approaches).
Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.
Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:
$dom = new DOMDocument();
$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
if ( count($anchors->length) > 0 ) {
foreach ( $anchors as $anchor ) {
if ( $anchor->hasAttribute('href') ) {
$url = $anchor->getAttribute('href');
//now figure out whether to processs this
//URL and add it to a list of URLs to be fetched
}
}
}
Finally, rather than write it yourself, see also this question for other resources you could use.
- is there a good web crawler library available for PHP or Ruby?
Writing a PHP script that open website's pages and stores page's content in variable
Check out http://sourceforge.net/projects/php-crawler/
Or try this simple code that searches for the presence of the Google Analytics tracking code:
// Disable time limit to keep the script running
set_time_limit(0);
// Domain to start crawling
$domain = "http://webdevwonders.com";
// Content to search for existence
$content = "google-analytics.com/ga.js";
// Tag in which you look for the content
$content_tag = "script";
// Name of the output file
$output_file = "analytics_domains.txt";
// Maximum urls to check
$max_urls_to_check = 100;
$rounds = 0;
// Array to hold all domains to check
$domain_stack = array();
// Maximum size of domain stack
$max_size_domain_stack = 1000;
// Hash to hold all domains already checked
$checked_domains = array();
// Loop through the domains as long as domains are available in the stack
// and the maximum number of urls to check is not reached
while ($domain != "" && $rounds < $max_urls_to_check) {
$doc = new DOMDocument();
// Get the sourcecode of the domain
@$doc->loadHTMLFile($domain);
$found = false;
// Loop through each found tag of the specified type in the dom
// and search for the specified content
foreach($doc->getElementsByTagName($content_tag) as $tag) {
if (strpos($tag->nodeValue, $content)) {
$found = true;
break;
}
}
// Add the domain to the checked domains hash
$checked_domains[$domain] = $found;
// Loop through each "a"-tag in the dom
// and add its href domain to the domain stack if it is not an internal link
foreach($doc->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
$href_array = explode("/", $href);
// Keep the domain stack to the predefined max of domains
// and only push domains to the stack that have not been checked yet
if (count($domain_stack) < $max_size_domain_stack &&
$checked_domains["http://".$href_array[2]] === null) {
array_push($domain_stack, "http://".$href_array[2]);
}
};
}
// Remove all duplicate urls from stack
$domain_stack = array_unique($domain_stack);
$domain = $domain_stack[0];
// Remove the assigned domain from domain stack
unset($domain_stack[0]);
// Reorder the domain stack
$domain_stack = array_values($domain_stack);
$rounds++;
}
$found_domains = "";
// Add all domains where the specified search string
// has been found to the found domains string
foreach ($checked_domains as $key => $value) {
if ($value) {
$found_domains .= $key."\n";
}
}
// Write found domains string to specified output file
file_put_contents($output_file, $found_domains);
I found it here.
simple PHP web crawler working on some, certain types of, pages
I'm not sure what your using to download URLs.
I'd recommend using this:
http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading
I'm fairly sure Google uses 301 or 302 redirects from links in the search results. So you need your crawler to follow redirects. I assume this is the problem.
Using that class, you need to use the option: CURLOPT_FOLLOWLOCATION
See: http://php.net/manual/en/function.curl-setopt.php
Further, if you are planning on scrapping Google, you'll need a lot of sleeps, and or some good proxies. Google blocks automated queries. A way around this somewhat is to pay $100 for Google XML results via Google Custom Search.
PHP crawler for one special HTML element
try this you can use xpath
to get your result
$html = '<html>
<body>
<div class="my"> One </div>
<div class="my"> Two </div>
<div class="my"> Three </div>
<div class="other"> NO </div>
<div class="other2"> NO </div>
</body>
</html>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[@class="my"]');
foreach ($tags as $tag) {
$node_value = trim($tag->nodeValue);
echo $node_value."<br/>";
}
Related Topics
How to Prevent My Site Page to Be Loaded Via 3Rd Party Site Frame of Iframe
How to Use Basic Authorization in PHP Curl
Including a Remote File in PHP
Creating a Thumbnail from an Uploaded Image
Pinging an Ip Address Using PHP and Echoing the Result
Best Way to Manage Long-running PHP Script
Imagecreatefrompng() Makes a Black Background Instead of Transparent
Avoid Resending Forms on PHP Pages
How to List All Months Between Two Dates
Are MySQL_Real_Escape_String() and MySQL_Escape_String() Sufficient For App Security
Instantiate a Class from a Variable in PHP
Pdo With Insert into Through Prepared Statements
Difference in Accessing Arrays in PHP 5.3 and 5.4 or Some Configuration Mismatch
Generating (Pseudo)Random Alpha-numeric Strings
Fpdf Error: Some Data Has Already Been Output, Can't Send Pdf
Resize Animated Gif File Without Destroying Animation
How to Set Proper Codeigniter Base Url
How to "Echo" a "Resource Id #6" from a MySQL Response in PHP