How to Detect Fake Users ( Crawlers ) and Curl

How to detect fake users ( crawlers ) and cURL

There is no magic solution to avoid automatic crawling. Everyting a human can do, a robot can do it too. There are only solutions to make the job harder, so hard that only strong skilled geeks may try to pass them.

I was in trouble too some years ago and my first advice is, if you have time, be a crawler yourself (I assume a "crawler" is the guy who crawls your website), this is the best school for the subject. By crawling several websites, I learned different kind of protections, and by associating them I’ve been efficient.

I give you some examples of protections you may try.


Sessions per IP

If a user uses 50 new sessions each minute, you can think this user could be a crawler who does not handle cookies. Of course, curl manages cookies perfectly, but if you couple it with a visit counter per session (explained later), or if your crawler is a noobie with cookie matters, it may be efficient.

It is difficult to imagine that 50 people of the same shared connection will get simultaneousely on your website (it of course depends on your traffic, that is up to you). And if this happens you can lock pages of your website until a captcha is filled.

Idea :

1) you create 2 tables : 1 to save banned ips and 1 to save ip and sessions

create table if not exists sessions_per_ip (
ip int unsigned,
session_id varchar(32),
creation timestamp default current_timestamp,
primary key(ip, session_id)
);

create table if not exists banned_ips (
ip int unsigned,
creation timestamp default current_timestamp,
primary key(ip)
);

2) at the beginning of your script, you delete entries too old from both tables

3) next you check if ip of your user is banned or not (you set a flag to true)

4) if not, you count how much he has sessions for his ip

5) if he has too much sessions, you insert it in your banned table and set a flag

6) you insert his ip on the sessions per ip table if it has not been already inserted

I wrote a code sample to show in a better way my idea.

<?php

try
{

// Some configuration (small values for demo)
$max_sessions = 5; // 5 sessions/ip simultaneousely allowed
$check_duration = 30; // 30 secs max lifetime of an ip on the sessions_per_ip table
$lock_duration = 60; // time to lock your website for this ip if max_sessions is reached

// Mysql connection
require_once("config.php");
$dbh = new PDO("mysql:host={$host};dbname={$base}", $user, $password);
$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

// Delete old entries in tables
$query = "delete from sessions_per_ip where timestampdiff(second, creation, now()) > {$check_duration}";
$dbh->exec($query);

$query = "delete from banned_ips where timestampdiff(second, creation, now()) > {$lock_duration}";
$dbh->exec($query);

// Get useful info attached to our user...
session_start();
$ip = ip2long($_SERVER['REMOTE_ADDR']);
$session_id = session_id();

// Check if IP is already banned
$banned = false;
$count = $dbh->query("select count(*) from banned_ips where ip = '{$ip}'")->fetchColumn();
if ($count > 0)
{
$banned = true;
}
else
{
// Count entries in our db for this ip
$query = "select count(*) from sessions_per_ip where ip = '{$ip}'";
$count = $dbh->query($query)->fetchColumn();
if ($count >= $max_sessions)
{
// Lock website for this ip
$query = "insert ignore into banned_ips ( ip ) values ( '{$ip}' )";
$dbh->exec($query);
$banned = true;
}

// Insert a new entry on our db if user's session is not already recorded
$query = "insert ignore into sessions_per_ip ( ip, session_id ) values ('{$ip}', '{$session_id}')";
$dbh->exec($query);
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

// We do not display anything now because we'll play with sessions :
// to make the demo more readable I prefer going step by step like
// this.
ob_start();

// Displays your current sessions
echo "Your current sessions keys are : <br/>";
$query = "select session_id from sessions_per_ip where ip = '{$ip}'";
foreach ($dbh->query($query) as $row) {
echo "{$row['session_id']}<br/>";
}

// Display and handle a way to create new sessions
echo str_repeat('<br/>', 2);
echo '<a href="' . basename(__FILE__) . '?new=1">Create a new session / reload</a>';
if (isset($_GET['new']))
{
session_regenerate_id();
session_destroy();
header("Location: " . basename(__FILE__));
die();
}

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
echo '<span style="color:red;">You are banned: wait 60secs to be unbanned... a captcha must be more friendly of course!</span>';
echo '<br/>';
echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
echo '<span style="color:blue;">You are not banned!</span>';
echo '<br/>';
echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
ob_end_flush();
}
catch (PDOException $e)
{
/*echo*/ $e->getMessage();
}

?>

Visit Counter

If your user uses the same cookie to crawl your pages, you’ll be able to use his session to block it. This idea is quite simple: is it possible that your user visits 60 pages in 60 seconds?

Idea :

  1. Create an array in the user session, it will contains visit time()s.
  2. Remove visits older than X seconds in this array
  3. Add a new entry for the actual visit
  4. Count entries in this array
  5. Ban your user if he visited Y pages

Sample code :

<?php

$visit_counter_pages = 5; // maximum number of pages to load
$visit_counter_secs = 10; // maximum amount of time before cleaning visits

session_start();

// initialize an array for our visit counter
if (array_key_exists('visit_counter', $_SESSION) == false)
{
$_SESSION['visit_counter'] = array();
}

// clean old visits
foreach ($_SESSION['visit_counter'] as $key => $time)
{
if ((time() - $time) > $visit_counter_secs) {
unset($_SESSION['visit_counter'][$key]);
}
}

// we add the current visit into our array
$_SESSION['visit_counter'][] = time();

// check if user has reached limit of visited pages
$banned = false;
if (count($_SESSION['visit_counter']) > $visit_counter_pages)
{
// puts ip of our user on the same "banned table" as earlier...
$banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
$count = count($_SESSION['visit_counter']);
echo "You visited {$count} pages.";
echo str_repeat('<br/>', 2);

echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

$('#reload').click(function(e) {
e.preventDefault();
window.location.reload();
});

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
echo '<span style="color:red;">You are banned! Wait for a short while (10 secs in this demo)...</span>';
echo '<br/>';
echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
echo '<span style="color:blue;">You are not banned!</span>';
echo '<br/>';
echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

An image to download

When a crawler need to do his dirty work, that’s for a large amount of data, and in a shortest possible time. That’s why they don’t download images on pages ; it takes too much bandwith and makes the crawling slower.

This idea (I think the most elegent and the most easy to implement) uses the mod_rewrite to hide code in a .jpg/.png/… an image file. This image should be available on each page you want to protect : it could be your logo website, but you’ll choose a small-sized image (because this image must not be cached).

Idea :

1/ Add those lines to your .htaccess

RewriteEngine On
RewriteBase /tests/anticrawl/
RewriteRule ^logo\.jpg$ logo.php

2/ Create your logo.php with the security

<?php

// start session and reset counter
session_start();
$_SESSION['no_logo_count'] = 0;

// forces image to reload next time
header("Cache-Control: no-store, no-cache, must-revalidate");

// displays image
header("Content-type: image/jpg");
readfile("logo.jpg");
die();

3/ Increment your no_logo_count on each page you need to add security, and check if it reached your limit.

Sample code :

<?php

$no_logo_limit = 5; // number of allowd pages without logo

// start session and initialize
session_start();
if (array_key_exists('no_logo_count', $_SESSION) == false)
{
$_SESSION['no_logo_count'] = 0;
}
else
{
$_SESSION['no_logo_count']++;
}

// check if user has reached limit of "undownloaded image"
$banned = false;
if ($_SESSION['no_logo_count'] >= $no_logo_limit)
{
// puts ip of our user on the same "banned table" as earlier...
$banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "You did not loaded image {$_SESSION['no_logo_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<a id="reload" href="#">Reload</a>

<script type="text/javascript">

$('#reload').click(function(e) {
e.preventDefault();
window.location.reload();
});

</script>

EOT;

echo str_repeat('<br/>', 2);

// Display "show image" link : note that we're using .jpg file
echo <<< EOT

<div id="image_container">
<a id="image_load" href="#">Load image</a>
</div>
<br/>

<script type="text/javascript">

// On your implementation, you'llO of course use <img src="logo.jpg" />
$('#image_load').click(function(e) {
e.preventDefault();
$('#image_load').html('<img src="logo.jpg" />');
});

</script>

EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
echo '<span style="color:red;">You are banned: click on "load image" and reload...</span>';
echo '<br/>';
echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
echo '<span style="color:blue;">You are not banned!</span>';
echo '<br/>';
echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>

Cookie check

You can create cookies in the javascript side to check if your users does interpret javascript (a crawler using Curl does not, for example).

The idea is quite simple : this is about the same as an image check.

  1. Set a $_SESSION value to 1 and increment it in each visits
  2. if a cookie (set in JavaScript) does exist, set session value to 0
  3. if this value reached a limit, ban your user

Code :

<?php

$no_cookie_limit = 5; // number of allowd pages without cookie set check

// Start session and reset counter
session_start();

if (array_key_exists('cookie_check_count', $_SESSION) == false)
{
$_SESSION['cookie_check_count'] = 0;
}

// Initializes cookie (note: rename it to a more discrete name of course) or check cookie value
if ((array_key_exists('cookie_check', $_COOKIE) == false) || ($_COOKIE['cookie_check'] != 42))
{
// Cookie does not exist or is incorrect...
$_SESSION['cookie_check_count']++;
}
else
{
// Cookie is properly set so we reset counter
$_SESSION['cookie_check_count'] = 0;
}

// Check if user has reached limit of "cookie check"
$banned = false;
if ($_SESSION['cookie_check_count'] >= $no_cookie_limit)
{
// puts ip of our user on the same "banned table" as earlier...
$banned = true;
}

// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...

echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';

// Display counter
echo "Cookie check failed {$_SESSION['cookie_check_count']} times.";
echo str_repeat('<br/>', 2);

// Display "reload" link
echo <<< EOT

<br/>
<a id="reload" href="#">Reload</a>
<br/>

<script type="text/javascript">

$('#reload').click(function(e) {
e.preventDefault();
window.location.reload();
});

</script>

EOT;

// Display "set cookie" link
echo <<< EOT

<br/>
<a id="cookie_link" href="#">Set cookie</a>
<br/>

<script type="text/javascript">

// On your implementation, you'll of course put the cookie set on a $(document).ready()
$('#cookie_link').click(function(e) {
e.preventDefault();
var expires = new Date();
expires.setTime(new Date().getTime() + 3600000);
document.cookie="cookie_check=42;expires=" + expires.toGMTString();
});

</script>
EOT;

// Display "unset cookie" link
echo <<< EOT

<br/>
<a id="unset_cookie" href="#">Unset cookie</a>
<br/>

<script type="text/javascript">

// On your implementation, you'll of course put the cookie set on a $(document).ready()
$('#unset_cookie').click(function(e) {
e.preventDefault();
document.cookie="cookie_check=;expires=Thu, 01 Jan 1970 00:00:01 GMT";
});

</script>
EOT;

// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
echo '<span style="color:red;">You are banned: click on "Set cookie" and reload...</span>';
echo '<br/>';
echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
echo '<span style="color:blue;">You are not banned!</span>';
echo '<br/>';
echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}

Protection against proxies

Some words about the different kind of proxies we may find over the web :

  • A “normal” proxy displays information about user connection (notably, his IP)
  • An anonymous proxy does not display IP, but gives information about proxy usage on header.
  • A high-anonyous proxy do not display user IP, and do not display any information that a browser may not send.

It is easy to find a proxy to connect any website, but it is very hard to find high-anonymous proxies.

Some $_SERVER variables may contain keys specifically if your users is behind a proxy (exhaustive list took from this question):

  • CLIENT_IP
  • FORWARDED
  • FORWARDED_FOR
  • FORWARDED_FOR_IP
  • HTTP_CLIENT_IP
  • HTTP_FORWARDED
  • HTTP_FORWARDED_FOR
  • HTTP_FORWARDED_FOR_IP
  • HTTP_PC_REMOTE_ADDR
  • HTTP_PROXY_CONNECTION'
  • HTTP_VIA
  • HTTP_X_FORWARDED
  • HTTP_X_FORWARDED_FOR
  • HTTP_X_FORWARDED_FOR_IP
  • HTTP_X_IMFORWARDS
  • HTTP_XROXY_CONNECTION
  • VIA
  • X_FORWARDED
  • X_FORWARDED_FOR

You may give a different behavior (lower limits etc) to your anti crawl securities if you detect one of those keys on your $_SERVER variable.


Conclusion

There is a lot of ways to detect abuses on your website, so you'll find a solution for sure. But you need to know precisely how your website is used, so your securities will not be aggressive with your "normal" users.

How can one detect if a server/script is accessing their site through cURL/file_get_contents()? (excluding user-agents and IP addresses)

It is indeed a cookie that is set by JavaScript then a redirect, to the original image. The problem is that curl/fgc wont parse the html and set the cookie its only cookies set by the server that curl will store in its cookie jar.

This is the code you get before the redirect, it makes a cookie via JavaScript with no name but location.href as the value:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<HEAD>
<TITLE>http://phim.xixam.com/thumb/giotdang.jpeg</TITLE>
<meta http-equiv="Refresh" content="0;url=http://phim.xixam.com/thumb/giotdang.jpeg">
</HEAD>
<script type="text/javascript">
window.onload = function checknow() {
var today = new Date();
var expires = 3600000*1*1;
var expires_date = new Date(today.getTime() + (expires));
var ua = navigator.userAgent.toLowerCase();
if ( ua.indexOf( "safari" ) != -1 ) { document.cookie = "location.href"; } else { document.cookie = "location.href;expires=" + expires_date.toGMTString(); }
}
</script>
<BODY>
</BODY></HTML>

But all is not lost, because by pre-setting/forging the cookie you can circumvent this security measure (a reason why using cookies for any kind of security is bad).

cookie.txt

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

phim.xixam.com FALSE /thumb/ FALSE 1338867990 location.href

So the finnished curl script would look something like:

<?php
function curl_get($url){
$return = '';
(function_exists('curl_init')) ? '' : die('cURL Must be installed!');

//Forge the cookie
$expire = time()+3600000*1*1;
$cookie =<<<COOKIE
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

phim.xixam.com FALSE /thumb/ FALSE $expire location.href

COOKIE;
file_put_contents(dirname(__FILE__).'/cookie.txt',$cookie);

//Browser Masquerade cURL request
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/json,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";

curl_setopt($curl, CURLOPT_COOKIEJAR, dirname(__FILE__).'/cookie.txt');
curl_setopt($curl, CURLOPT_COOKIEFILE, dirname(__FILE__).'/cookie.txt');
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, 0);
//Pass the referer check
curl_setopt($curl, CURLOPT_REFERER, 'http://xixam.com/forum.php');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 30);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);

$html = curl_exec($curl);
curl_close($curl);
return $html;
}

$image = curl_get('http://phim.xixam.com/thumb/giotdang.jpeg');

file_put_contents('test.jpg',$image);
?>

The only way to stop a crawler is to log all your visitors ips in your database and increment a value based on visits per ip, then once a week or so look at the top hits by ip and then do a reverse lookup of the ip and see if its from a hosting provider if so block it at your firewall or in htaccess, other then that you cant really stop the request to a resource if its publicly available as any hurdle can be overcome.

Hope it helps.

How can one detect a crawler / spider using PHP?

According to Verifying Googlebot:

You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

For example:

host 66.249.66.1

1.66.249.66.in-addr.arpa domain name pointer

crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com

crawl-66-249-66-1.googlebot.com has address 66.249.66.1

Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

You can do a reverse DNS lookup:

function validateGoogleBotIP($ip) {
$hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"

return preg_match('/\.google(bot)?\.com$/i', $hostname);
}

if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
echo 'It is ACTUALLY google';
} else {
echo 'Someone\'s faking it!';
}
} else {
echo 'Nothing to do with Google';
}

how to detect search engine bots with php?

Here's a Search Engine Directory of Spider names

Then you use $_SERVER['HTTP_USER_AGENT']; to check if the agent is said spider.

if(strstr(strtolower($_SERVER['HTTP_USER_AGENT']), "googlebot"))
{
// what to do
}

How to use JavaScript or/and PHP, to detect a website/page being stolen/cloned and then redirect reader back to my website

1. Best Solution - Early Detection

Depending on your main traffic source, it is possible to detect who is scrapping you and block them based on their IP, Headers, number of page views and other data, using PHP & HTACCESS.

I really like this answer on the StackOverflow, that discusses almost all the options available for early detection.

How to detect fake users ( crawlers ) and cURL

2. Plugins & Extensions for Open Source Content Management Systems

Wordpress

If using Wordpress CMS, you can try some plugins, like WordFence, that can detect and block fake Google Crawlers, block based on the number of page views etc.

Other CMS

If you can't find a similar solution for your CMS of choice, consider to ask a community for a help with creating the solution like that, as I believe many people could benefit from it.

3. Solution for already stolen content with JavaScript

Sometimes the easiest road to hide something in JS, is to actually HIDE something by OBFUSCATING and by hiding in multiple important files. For example, obfuscate some important file on your website without which the website just wouldn't work properly.

For example, put an obfuscated version of the code below somewhere in JS file in the header, Obfuscate this code using any free services online or download your own library on Github:

Non-Obfuscated:

w='mysite.com'; // Current URL e.g. 'mysite.com/category1/page2/'
function check_origin(){
var check = 587;
if(window.location.hostname != w){
window.location.href = w;
}
return check;
}
var check = check_origin();

Obfuscated example:

var _0x303e=["\x6D\x79\x73\x69\x74\x65\x2E\x63\x6F\x6D","\x68\x6F\x73\x74\x6E\x61\x6D\x65","\x6C\x6F\x63\x61\x74\x69\x6F\x6E","\x68\x72\x65\x66"];w= _0x303e[0];function check_origin(){var check=587;if(window[_0x303e[2]][_0x303e[1]]!= w){window[_0x303e[2]][_0x303e[3]]= w};return check}var check=check_origin()

Now put an additional code in some Footer JS File, to verify the code above wasn't modified in any way:

Non-Obfuscated example:

 if(w!=='mysite.com'||check == false || typeof check == 'undefined' || check !== 587 ){
window.location.href = 'mysite.com';
}

Obfuscated:

var _0x92bb=["\x6D\x79\x73\x69\x74\x65\x2E\x63\x6F\x6D","\x75\x6E\x64\x65\x66\x69\x6E\x65\x64","\x68\x72\x65\x66","\x6C\x6F\x63\x61\x74\x69\x6F\x6E"];if(w!== _0x92bb[0]|| check== false||  typeof check== _0x92bb[1]|| check!== 587){window[_0x92bb[3]][_0x92bb[2]]= _0x92bb[0]}

I have used free online service from Google's search results for the term "Free Online JS Obfuscator:

https://javascriptobfuscator.com/Javascript-Obfuscator.aspx

4. Fight thieves with available methods e.g. Request a Ban from Search Engines – The Digital Millennium Copyright Act of 1998

Here is a blog-post that describes what to do when someone is stealing your content.

https://lorelle.wordpress.com/2006/04/10/what-do-you-do-when-someone-steals-your-content/

You can investigate who is doing that and report them to their partners, search engines, advertisers - to disrupt their business.

Depending on their country of origin and yours, it is maybe even possible to sue them and win.

How to check if the request is from google, facebook, twitter and bing crawlers?

They usually send that kind of info through the user agent, something like

  • Google crawler
  • Yahoo
  • Bing

Or something similar, in php you can find the user agent with

$_SERVER['HTTP_USER_AGENT'];

However, you should be aware that a user agent can be spoofed.



Related Topics



Leave a reply



Submit