Tell bots apart from human visitors for stats?
Humans and bots will do similar things, but bots will do things that humans don't. Let's try to identify those things. Before we look at behavior, let's accept RayQuang's comment as being useful. If a visitor has a bot's user-agent string, it's probably a bot. I can't image anybody going around with "Google Crawler" (or something similar) as a UA unless they're working on breaking something. I know you don't want to update a list manually, but auto-pulling that one should be good, and even if it stays stale for the next 10 years, it will be helpful.
Some have already mentioned Javascript and image loading, but Google will do both. We must assume there are now several bots that will do both, so those are no longer human indicators. What bots will still uniquely do, however, is follow an "invisible" link. Link to a page in a very sneaky way that I can't see as a user. If that gets followed, we've got a bot.
Bots will often, though not always, respect robots.txt. Users don't care about robots.txt, and we can probably assume that anybody retrieving robots.txt is a bot. We can go one step further, though, and link a dummy CSS page to our pages that is excluded by robots.txt. If our normal CSS is loaded but our dummy CSS isn't, it's definitely a bot. You'll have to build (probably an in-memory) table of loads by IP and do an not contained in match, but that should be a really solid tell.
So, to use all this: maintain a database table of bots by ip address, possibly with timestamp limitations. Add anything that follows your invisible link, add anything that loads the "real" CSS but ignores the robots.txt CSS. Maybe add all the robots.txt downloaders as well. Filter the user-agent string as the last step, and consider using this to do a quick stats analysis and see how strongly those methods appear to be working for identifying things we know are bots.
Identify visit from robot
read User Agent
if( preg_match('/robot|spider|crawler|curl|^$/i', $_SERVER['HTTP_USER_AGENT'])) )
{
echo 'Is bot or spider or crawler or curl or not human';
}
else
{
echo 'Is human';
}
You can find a list of about 300 common user-agents given by bots here: http://www.robotstxt.org/db.html
HTML 5 Storage against Crawler and Bots
I used this tool to render the page via Google-bot and the result is that Google-bot supports HTML 5 Storage:
The code to test the storage supprot: https://codepen.io/gab/pen/AxFoB
this code uses this code to detect:
/* Detect browser can use web storage */
if (!typeof(Storage) !== 'undefined') {
$('#yay').fadeIn('slow');
} else {
$('#ooh').fadeIn('slow');
}
The tool to fetch and render as bot:
https://technicalseo.com/seo-tools/fetch-render/
the result of render:
Detecting if your site is being accessed by a robot
Presenting different content to search engine crawlers and human visitors - called cloaking - is a risky thing, and can be punished by the search engine if detected.
That said, check out this SO answer with several links to well-maintained "bot lists". You would have to parse the USER_AGENT string and compare it against such a bot list.
Related Topics
Get Custom Product Attributes in Woocommerce
Why Are "Echo" Short Tags Permanently Enabled as of PHP 5.4
Sort an Array by a Child Array's Value in PHP
Laravel Eager Loading with Limit
Geo-Search (Distance) in PHP/MySQL (Performance)
How to Embed Images in HTML Email
When Will _Destruct Not Be Called in PHP
Call Laravel Controller via Command Line
How to Get the Client's Ip Address in a PHP Webservice
Disable Deprecated Warning in Symfony 2(.7)
PHP Flush/Ob_Flush Not Working
Convert All Types of Smart Quotes with PHP