How to identify web-crawler?
There are two general ways to detect robots and I would call them "Polite/Passive" and "Aggressive". Basically, you have to give your web site a psychological disorder.
Polite
These are ways to politely tell crawlers that they shouldn't crawl your site and to limit how often you are crawled. Politeness is ensured through robots.txt file in which you specify which bots, if any, should be allowed to crawl your website and how often your website can be crawled. This assumes that the robot you're dealing with is polite.
Aggressive
Another way to keep bots off your site is to get aggressive.
User Agent
Some aggressive behavior includes (as previously mentioned by other users) the filtering of user-agent strings. This is probably the simplest, but also the least reliable way to detect if it's a user or not. A lot of bots tend to spoof user agents and some do it for legitimate reasons (i.e. they only want to crawl mobile content), while others simply don't want to be identified as bots. Even worse, some bots spoof legitimate/polite bot agents, such as the user agents of google, microsoft, lycos and other crawlers which are generally considered polite. Relying on the user agent can be helpful, but not by itself.
There are more aggressive ways to deal with robots that spoof user agents AND don't abide by your robots.txt file:
Bot Trap
I like to think of this as a "Venus Fly Trap," and it basically punishes any bot that wants to play tricks with you.
A bot trap is probably the most effective way to find bots that don't adhere to your robots.txt file without actually impairing the usability of your website. Creating a bot trap ensures that only bots are captured and not real users. The basic way to do it is to setup a directory which you specifically mark as off limits in your robots.txt file, so any robot that is polite will not fall into the trap. The second thing you do is to place a "hidden" link from your website to the bot trap directory (this ensures that real users will never go there, since real users never click on invisible links). Finally, you ban any IP address that goes to the bot trap directory.
Here are some instructions on how to achieve this:
Create a bot trap (or in your case: a PHP bot trap).
Note: of course, some bots are smart enough to read your robots.txt file, see all the directories which you've marked as "off limits" and STILL ignore your politeness settings (such as crawl rate and allowed bots). Those bots will probably not fall into your bot trap despite the fact that they are not polite.
Violent
I think this is actually too aggressive for the general audience (and general use), so if there are any kids under the age of 18, then please take them to another room!
You can make the bot trap "violent" by simply not specifying a robots.txt file. In this situation ANY BOT that crawls the hidden links will probably end up in the bot trap and you can ban all bots, period!
The reason this is not recommended is that you may actually want some bots to crawl your website (such as Google, Microsoft or other bots for site indexing). Allowing your website to be politely crawled by the bots from Google, Microsoft, Lycos, etc. will ensure that your site gets indexed and it shows up when people search for it on their favorite search engine.
Self Destructive
Yet another way to limits what bots can crawl on your website, is to serve CAPTCHAs or other challenges which a bot cannot solve. This comes at an expense of your users and I would think that anything which makes your website less usable (such as a CAPTCHA) is "self destructive." This, of course, will not actually block the bot from repeatedly trying to crawl your website, it will simply make your website very uninteresting to them. There are ways to "get around" the CAPTCHAs, but they're difficult to implement so I'm not going to delve into this too much.
Conclusion
For your purposes, probably the best way to deal with bots is to employ a combination of the above mentioned strategies:
- Filter user agents.
- Setup a bot trap (the violent one).
Catch all the bots that go into the violent bot trap and simply black-list their IPs (but don't block them). This way you will still get the "benefits" of being crawled by bots, but you will not have to pay to check the IP addresses that are black-listed due to going to your bot trap.
how to tell if a web request is coming from google's crawler?
I have captured google crawler request in my asp.net application and here's how the signature of the google crawler looks.
Requesting IP: 66.249.71.113
Client: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
My logs observe many different IPs for google crawler in 66.249.71.*
range. All these IPs are geo-located at Mountain View, CA, USA.
A nice solution to check if the request is coming from Google crawler would be to verify the request to contain Googlebot
and http://www.google.com/bot.html
. As I said there are many IPs observed with the same requesting client, I'd not recommend to check on IPs. And may be that's where Client identity come into the picture. So go for verifying client identity.
Here's a sample code in C#.
if (Request.UserAgent.ToLower().Contains("googlebot") ||
Request.UserAgent.ToLower().Contains("google.com/bot.html"))
{
//Yes, it's google bot.
}
else
{
//No, it's something else.
}
It's important to note that, any Http-client can easily fake this.
How to identify web crawlers?
Here is a list of user agents: http://www.user-agents.org/ (found here: https://webmasters.stackexchange.com/questions/3264/where-can-i-find-a-list-of-search-engine-crawler-user-agents-and-their-domain-na)
If it is too much, you could implement a Bloom filter (a memory-efficient solution to perform 'if exist' tests).
About whitelisting and good practices, this may also interest you: https://meta.stackexchange.com/questions/37231/why-does-the-stack-overflow-sitemap-xml-use-a-user-agent-whitelist-instead-of-a
how to detect search engine bots with php?
Here's a Search Engine Directory of Spider names
Then you use $_SERVER['HTTP_USER_AGENT'];
to check if the agent is said spider.
if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}
Identifying web crawlers
Request.Browser.Crawler is sadly out-of-date
You could add detection of other user-agents as bots, manually.
Use the Browser Element and not browserCaps as it is deprecated as of .NET 2.0
Example:
<browsers>
<browser id="Googlebot" parentID="Mozilla">
<identification>
<userAgent match="^Googlebot(\-Image)?/(?'version'(?'major'\d+)(?'minor'\.\d+)).*" />
</identification>
<capabilities>
<capability name="crawler" value="true" />
</capabilities>
</browser>
.
.
.
</browsers>
This must be saved with a .browser extension under the App_Browsers
directory in your application.
(List of Regexes to Match)
Detect Search Crawlers via JavaScript
This is the regex the ruby UA agent_orange
library uses to test if a userAgent
looks to be a bot. You can narrow it down for specific bots by referencing the bot userAgent list here:
/bot|crawler|spider|crawling/i
For example you have some object, util.browser
, you can store what type of device a user is on:
util.browser = {
bot: /bot|googlebot|crawler|spider|robot|crawling/i.test(navigator.userAgent),
mobile: ...,
desktop: ...
}
Related Topics
Using Ajax to Pass Variable to PHP and Retrieve Those Using Ajax Again
Zf2 - Get Controller Name into Layout/Views
File Attachment with PHPmailer
Is There Any Sort of "Pre Login" Event or Similar
Remove Trailing Slash from String PHP
Phpmailer - the Following Smtp Error: Data Not Accepted
How to Include a PHP.Ini File in Another PHP.Ini File
Mysqli + Xdebug Breakpoint After Closing Statement Result in Many Warnings
Php: Locale Aware Number Format
Does the Order of Class Definition Matter in PHP
Pass Array Literal to Postgresql Function
Php: iPad Does Not Play Mp4 Videos Delivered by PHP, But If Accessed Directly It Does
What Is the Maximum Size of an Array in PHP