How to Detect Search Engine Bots With PHP

how to detect search engine bots with php?

Here's a Search Engine Directory of Spider names

Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}

How can one detect a crawler / spider using PHP?

According to Verifying Googlebot:

You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

For example:

host 66.249.66.1

1.66.249.66.in-addr.arpa domain name pointer

crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com

crawl-66-249-66-1.googlebot.com has address 66.249.66.1

Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

You can do a reverse DNS lookup:

function validateGoogleBotIP($ip) {
$hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"

return preg_match('/\.google(bot)?\.com$/i', $hostname);
}

if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
echo 'It is ACTUALLY google';
} else {
echo 'Someone\'s faking it!';
}
} else {
echo 'Nothing to do with Google';
}

How to recognize bots with php?

You should filter by user-agent strings. You can find a list of about 300 common user-agents given by bots here: http://www.robotstxt.org/db.html Running through that list and ignoring bot user-agents before you run your SQL statement should solve your problem for all practical purposes.

If you don't want the search engines to even reach the page, use a basic robots.txt file to block them.

how to detect search engine visites on my site? like phpBB

You can go by either IP addresses or the 'User-Agent' string that the bot or web browser sends you.

When Googlebot (or most other well-behaving robots) visit your website, they'll send you a $_SERVER['HTTP_USER_AGENT'] variable which identifies what they are. Some examples are:

Googlebot/2.1 (+http://www.google.com/bot.html)

NutchCVS/0.8-dev (Nutch; http://lucene.apache.org/nutch/bot.html

Baiduspider+(+http://www.baidu.com/search/spider_jp.html)

Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/531.4 (KHTML, like Gecko)

You can find many more examples at these websites:
link text
link text

You could then use PHP to examine those user-agent strings and determine if the user is a search engine or not. I use something like this often:

$searchengines = array(
'Googlebot',
'Slurp',
'search.msn.com',
'nutch',
'simpy',
'bot',
'ASPSeek',
'crawler',
'msnbot',
'Libwww-perl',
'FAST',
'Baidu',
);
$is_se = false;
foreach ($searchengines as $searchengine){
if (!empty($_SERVER['HTTP_USER_AGENT']) and
false !== strpos(strtolower($_SERVER['HTTP_USER_AGENT']), strtolower($searchengine)))
{
$is_se = true;
break;
}
}
if ($is_se) { print('Its a search engine!'); }

Remember that no detection method (Google Analytics or another statistics package or otherwise) is going to be 100% accurate. Some web browsers allow you to set a custom user-agent string, and some misbehaving web crawlers may not send a user-agent string at all. This method can be probably effective for 95%+ of crawlers/visitors though.

Detect if a page is visited by a bot

Well, after some digging inside the Google I found this.

$agent = strpos(strtolower($_SERVER['HTTP_USER_AGENT']));
foreach($bots as $name => $bot)
{
if(stripos($agent,$bot)!==false)
{
return true;
}
else {
return false;
}
}

Thanks for the support Dale!!



Related Topics



Leave a reply



Submit