How to Prevent Site Scraping

How to protect website data from web scraping?

There is only one full-proof method against scrapers, that is captcha. But as it affects user experience most of the websites avoid it.

Another option is using AJAX for loading data. This will help avoiding the scrapers which are not built to render JavaScript, but one can make one using Selenium WebDriver. In addition AJAX is also bad for SEO, in case you are into google rankings and all.

A more effiecient and awesome way will be tracking the user behaviour and saving the information into cookies, if something seems suspicious serve a captcha to user. Just how the google captcha works on several sites.

Check this link : https://blog.hartleybrody.com/prevent-scrapers/

How to prevent someone from scraping my website data?

I think that being a web-developer these days is terrifying and that maybe there is a temptation to go into "overkill" when it comes to web security. As the other answers have mentioned, it is impossible to stop automated scraping and it shouldn't worry you if you follow these guidelines:

  • It is great that you are considering website security. Never change.

  • Never send anything from the server you don't want the user to see. If the user is not authorised to see it, don't send it. Don't "hide" important bits and pieces in jQuery.data() or data-attributes. Don't squirrel things away in obfuscated JavaScript. Don't use techniques to hide data on the page until the user logs in, etc, etc.

    Everything - everything - is visible if it leaves the server.

  • If you have content you want to protect from "content farm" scraping use email verified user registration (including some form of GOOD reCaptcha to confound - most of - the bots).

  • Protect your server!!! As best you can, make sure you don't leave any common exploits. Read this -> http://owasp.org/index.php/Category:How_To <- Yes. All of it ;)

  • Prevent direct access to your files. The more traditional approach is defined('_SOMECONSTANT') or die('No peeking, hacker!'); at the top of your PHP document. If the file is not accessed through the proper channels, nothing important will be sent from the server.

    You can also meddle with your .htaccess or go large and in charge.

Are you perhaps worried about cross site scripting (XSS)?

If you are worried about data being intercepted when the user enters login information, you can implement double verification (like Facebook) or use SSL

It really all boils down to what your site will do. If it is a run of the mill site, cover the basics in the bullet points and hope for the best ;) If it is something sensitive like a banking site... well... don't do a banking site just yet :P


Just as an aside: I never touch credit card numbers and such. Any website I develop will politely API on to a company with insurance and fleets of staff dedicated to security (not just little old me and my shattered nerves).

Protection from screen scraping

So, one approach would be to obfuscate the code (rot13, or something), and then have some javascript in the page that do something like document.write(unobfuscate(obfuscated_page)). But this totally blows away search engines (probably!).

Of course this doesn’t actually stop someone who wants to steal your data either, but it does make it harder.

Once the client has the data it is pretty much game over, so you need to look at something on the server side.

Given that search engines are basically screen scrapers things are difficult. You need to look at what the difference between the good screen scrapers and the bad screen scrapers are. And of course, you have just the normal human users as well. So this comes down to a problem of how can you on the server effectively classify as request as coming from a human, a good screen scraper, or a bad screen scraper.

So, the place to start would be looking at your log-files and seeing if there is some pattern that allows you to effectively classify requests, and then on determining the pattern see if there is some way that a bad screen scraper, upon knowing this classification, could cloak itself to appear like a human or good screen scraper.

Some ideas:

  • You may be able to determine the good screen scrapers by IP address(es)..
  • You could potentially determine scraper vs. human by number of concurrent connections, total number of connections per time-period, access pattern, etc.

Obviously these aren’t ideal or fool-proof. Another tactic is to determine what measures can you take that are unobtrusive to humans, but (may be) annoying for scrapers. An example might be slowing down the number of requests. (Depends on the time criticality of the request. If they are scraping in real-time, this would effect their end users).

The other aspect is to look at serving these users better. Clearly they are scraping because they want the data. If you provide them an easy way in which to directly obtain the data in a useful format then that will be easier for them to do instead of screen scraping. If there is an easy way then access to the data can be regulated. E.g: give requesters a unique key, and then limit the number of requests per key to avoid overload on the server, or charge per 1000 requests, etc.

Of course there are still people who will want to rip you off, and then there are probably other ways to disincentivise, bu they probably start being non-technical, and require legal avenues to be persued.

I'm being scraped, how can I prevent this?

There are plenty of techniques in the anti-scraping world. I'd just categorize them. If you find something missing in my answer please comment.

A. Server side filtering based on web requests

1. Blocking suspicious IP or IPs.

The blocking suspicious IPs works well but today most of scraping is done using IP proxying so for a long run it wouldn't be effective. In your case you get requests from the same IP geo location, so if you ban this IP, the scrapers will surely leverage IP proxying thus staying IP independent and undetected.

2. Using DNS level filtering

Using DNS firewall pertains to the anti-scrape measure. Shortly saying this is to set up you web service to a private domain name servers (DNS) network that will filter and prevent bad requests before they reach your server. This sophisticated measure is provided by some companies for complex website protection and you might get deeper in viewing an example of such a service.

3. Have custom script to track users' statistic and drop troublesome requests

As you've mentioned you've detected an algorithm a scraper crawls urls. Have a custom script that tracks the request urls and based on this turns on protection measures. For this you have to activate a [shell] script in IIS. Side effect might be that the system response timing will increase, slowing down your services. By the way the algorithm that you've detected might be changed thus leaving this measure off.

4. Limit requests frequency

You might set a limitation of the frequency of requests or downloadable data amount. The restrictions must be applied considering the usability for a normal user. When compared to the scraper insistent requests you might set your web service rules to drop or delay unwanted activity. Yet if scraper gets reconfigured to imitate common user behaviour (thru some nowdays well-known tools: Selenuim, Mechanize, iMacros) this measure will fail off.

5. Setting maximum session length

This measure is a good one but usually modern scrapers do perform session authentication thus cutting off session time is not that effective.

B. Browser based identification and preventing

1. Set CAPTCHAs for target pages

This is the old times technique that for most part does solve scraping issue. Yet, if your scraping opponent leverages any of anti-captcha services this protection will most likely be off.

2. Injecting JavaScript logic into web service response

JavaScript code should arrive to client (user's browser or scraping server) prior to or along with requested html content. This code functions to count and return a certain value to the target server. Based on this test the html code might be malformed or might even be not sent to the requester, thus leaving malicious scrapers off. The logic might be placed in one or more JavaScript-loadable files. This JavaScript logic might be applied not just to the whole content but also to only certain parts of site's content (ex. prices). To bypass this measure scrapers might need to turn to even more complex scraping logic (usually of JavaScript) that is highly customizable and thus costly.

C. Content based protection

1. Disguising important data as images

This method of content protection is widely used today. It does prevent scrapers to collect data. Its side effect is that the data obfuscated as images are hidden for search engine indexing, thus downgrading site's SEO. If scrapers leverage a OCR system this kind of protection is again might be bypassed.

2. Frequent page structure change

This is far effective way for scrape protection. It works not just to change elements ids and classes but the entire hierarchy. The latter involving styling restructuring thus imposing additional costs. Sure, the scraper side must adapt to a new structure if it wants to keep content scraping. Not much side effects if your service might afford it.

How can I prevent my asp.net site from being screen scraped?

It is possible to try to detect screen scrapers:

Use cookies and timing, this will make it harder for those out of the box screen scrapers. Also check for javascript support, most scrapers do not have it. Check Meta browser data to verify it is really a web browser.

You can also check for requests in a minute, a user driving a browser can only make a small number of requests per minute, so logic on the server that detects too many requests per minute could presume that screen scraping is taking place and prevent access from the offending IP address for some period of time. If this starts to affect crawlers, log the users ip that is blocked, and start allowing their IPs as needed.

You can use http://www.copyscape.com/ to proect your content also, this will at least tell you who is reusing your data.

See this question also:

Protection from screen scraping

Also take a look at

http://blockscraping.com/

Nice doc about screen scraping:

http://www.realtor.org/wps/wcm/connect/5f81390048be35a9b1bbff0c8bc1f2ed/scraping_sum_jun_04.pdf?MOD=AJPERES&CACHEID=5f81390048be35a9b1bbff0c8bc1f2ed

How to prevent screen scraping:

http://mvark.blogspot.com/2007/02/how-to-prevent-screen-scraping.html

How I do to block Web scraping without blocking Well behaved bots?

You can discover the IP addresses the Google and others are using by checking visitor IPs with whois (in the command line or on a web site). Then, once you've accumulated a stash of legit search engines, allow them into your product list without the CAPTCHA.

Top techniques to avoid 'data scraping' from a website database

If the data is published, it's visible and accessible to everyone on the Internet. This includes the people you want to see it and the people you don't.

You can't have it both ways. You can make it so that data can only be visible with an account, and people will make accounts to slurp the data. You can make it so that the data can only be visible from approved IP addresses, and people will go through the steps to acquire approval before slurping it.

Yes, you can make it hard to get, but if you want it to be convenient for typical users you need to make it convenient for malicious ones as well.



Related Topics



Leave a reply



Submit