Tell Bots Apart from Human Visitors for Stats

Tell bots apart from human visitors for stats?

Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.

Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.

Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.

So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.

Identify visit from robot

read User Agent

if( preg_match('/robot|spider|crawler|curl|^$/i', $_SERVER['HTTP_USER_AGENT'])) )
{
echo 'Is bot or spider or crawler or curl or not human';
}
else
{
echo 'Is human';
}

You can find a list of about 300 common user-agents given by bots here: http://www.robotstxt.org/db.html

HTML 5 Storage against Crawler and Bots

I used this tool to render the page via Google-bot and the result is that Google-bot supports HTML 5 Storage:

The code to test the storage supprot: https://codepen.io/gab/pen/AxFoB

this code uses this code to detect:

/* Detect browser can use web storage */
if (!typeof(Storage) !== 'undefined') {
$('#yay').fadeIn('slow');
} else {
$('#ooh').fadeIn('slow');
}

The tool to fetch and render as bot:
https://technicalseo.com/seo-tools/fetch-render/

the result of render:
Sample Image

Detecting if your site is being accessed by a robot

Presenting different content to search engine crawlers and human visitors - called cloaking - is a risky thing, and can be punished by the search engine if detected.

That said, check out this SO answer with several links to well-maintained "bot lists". You would have to parse the USER_AGENT string and compare it against such a bot list.



Related Topics



Leave a reply



Submit