Anyone Know of a Good Python Based Web Crawler That I Could Use

Anyone know of a good Python based web crawler that I could use?

  • Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
  • Twill is a simple scripting language built on top of Mechanize
  • BeautifulSoup + urllib2 also works quite nicely.
  • Scrapy looks like an extremely promising project; it's new.

What are the best prebuilt libraries for doing Web Crawling in Python

Use Scrapy.

It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:

  • Built-in support for parsing HTML, XML, CSV, and Javascript
  • A media pipeline for scraping items with images (or any other media) and download the image files as well
  • Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
  • Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
  • Interactive scraping shell console, very useful for developing and debugging
  • Web management console for monitoring and controlling your bot
  • Telnet console for low-level access to the Scrapy process

Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:

class Torrent(ScrapedItem):
pass

class MininovaSpider(CrawlSpider):
domain_name = 'mininova.org'
start_urls = ['http://www.mininova.org/today']
rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]

def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = Torrent()

torrent.url = response.url
torrent.name = x.x("//h1/text()").extract()
torrent.description = x.x("//div[@id='description']").extract()
torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
return [torrent]

What's a good Web Crawler tool

HTTrack -- http://www.httrack.com/ -- is a very good Website copier. Works pretty good. Have been using it for a long time.

Nutch is a web crawler(crawler is the type of program you're looking for) -- http://lucene.apache.org/nutch/ -- which uses a top notch search utility lucene.

Python Web crawler, print only link that contains certain word in it's path - Mechanize, Beautiful Soup etc

Yes, it's as easy as using a regex or plain old Python string find() on link.url. (EDIT: you can also use 'kontakt' in link.url as shshank does)

for link in br.links(text_regex=re.compile('^((?!IMG).)*$')):

if link.url.find('kontakt')>=0: ...do stuff on urls containing contact
# or:
if link.url.find('kontakt')<0: continue # skip urls without

Obviously both of these (string find() method or in operator) can match anywhere in the string, which is a little sloppy.
What you want to do here is only match inside the url tail.
You can check just the tail using find() on link.url.split('/')[-1]

or else link.url.rsplit('/',2)[1]

How to build a web crawler based on Scrapy to run forever?

Scrapy is a framework for the spidering of websites, as such, it is intended to support your criteria but it isn't going to dance for you out of the box; you will probably have to get relatively familiar with the module for some tasks.

  1. Running forever is up to your application that calls Scrapy. You tell the spiders where to go and when to go there.
  2. Giving priorities is the job of Scheduler middleware which you'd have to create and plug into Scrapy. The documentation on this appears spotty and I've not looked at the code - in principle the function is there.
  3. Scrapy is inherently, fundamentally asynchronous which may well be what you are desiring: request B can be satisfied while request A is still outstanding. The underlying connection engine does not prevent you from bona fide multi-threading, but Scrapy doesn't provide threading services.

Scrapy is a library, not an application. There is a non-trivial amount of work (code) that a user of the module needs to make.

Python web crawler with MySQL database

yes i know,

libraries

https://github.com/djay/transmogrify.webcrawler

http://code.google.com/p/harvestman-crawler/

http://code.activestate.com/pypm/orchid/

open source web crawler

http://scrapy.org/

tutorials

http://www.example-code.com/python/pythonspider.asp

PS I don't know if they use mysql because normally python either uses sqlit or postgre sql so if you want you could use the libraries i gave you and import the python-mysql module and do it :D

http://sourceforge.net/projects/mysql-python/



Related Topics



Leave a reply



Submit