Anyone know of a good Python based web crawler that I could use?
- Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
- Twill is a simple scripting language built on top of Mechanize
- BeautifulSoup + urllib2 also works quite nicely.
- Scrapy looks like an extremely promising project; it's new.
What are the best prebuilt libraries for doing Web Crawling in Python
Use Scrapy.
It is a twisted-based web crawler framework. Still under heavy development but it works already. Has many goodies:
- Built-in support for parsing HTML, XML, CSV, and Javascript
- A media pipeline for scraping items with images (or any other media) and download the image files as well
- Support for extending Scrapy by plugging your own functionality using middlewares, extensions, and pipelines
- Wide range of built-in middlewares and extensions for handling of compression, cache, cookies, authentication, user-agent spoofing, robots.txt handling, statistics, crawl depth restriction, etc
- Interactive scraping shell console, very useful for developing and debugging
- Web management console for monitoring and controlling your bot
- Telnet console for low-level access to the Scrapy process
Example code to extract information about all torrent files added today in the mininova torrent site, by using a XPath selector on the HTML returned:
class Torrent(ScrapedItem):
pass
class MininovaSpider(CrawlSpider):
domain_name = 'mininova.org'
start_urls = ['http://www.mininova.org/today']
rules = [Rule(RegexLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
x = HtmlXPathSelector(response)
torrent = Torrent()
torrent.url = response.url
torrent.name = x.x("//h1/text()").extract()
torrent.description = x.x("//div[@id='description']").extract()
torrent.size = x.x("//div[@id='info-left']/p[2]/text()[2]").extract()
return [torrent]
What's a good Web Crawler tool
HTTrack -- http://www.httrack.com/ -- is a very good Website copier. Works pretty good. Have been using it for a long time.
Nutch is a web crawler(crawler is the type of program you're looking for) -- http://lucene.apache.org/nutch/ -- which uses a top notch search utility lucene.
Python Web crawler, print only link that contains certain word in it's path - Mechanize, Beautiful Soup etc
Yes, it's as easy as using a regex or plain old Python string find()
on link.url. (EDIT: you can also use 'kontakt' in link.url
as shshank does)
for link in br.links(text_regex=re.compile('^((?!IMG).)*$')):
if link.url.find('kontakt')>=0: ...do stuff on urls containing contact
# or:
if link.url.find('kontakt')<0: continue # skip urls without
Obviously both of these (string find()
method or in
operator) can match anywhere in the string, which is a little sloppy.
What you want to do here is only match inside the url tail.
You can check just the tail using find()
on link.url.split('/')[-1]
or else link.url.rsplit('/',2)[1]
How to build a web crawler based on Scrapy to run forever?
Scrapy is a framework for the spidering of websites, as such, it is intended to support your criteria but it isn't going to dance for you out of the box; you will probably have to get relatively familiar with the module for some tasks.
- Running forever is up to your application that calls Scrapy. You tell the spiders where to go and when to go there.
- Giving priorities is the job of Scheduler middleware which you'd have to create and plug into Scrapy. The documentation on this appears spotty and I've not looked at the code - in principle the function is there.
- Scrapy is inherently, fundamentally asynchronous which may well be what you are desiring: request B can be satisfied while request A is still outstanding. The underlying connection engine does not prevent you from bona fide multi-threading, but Scrapy doesn't provide threading services.
Scrapy is a library, not an application. There is a non-trivial amount of work (code) that a user of the module needs to make.
Python web crawler with MySQL database
yes i know,
libraries
https://github.com/djay/transmogrify.webcrawler
http://code.google.com/p/harvestman-crawler/
http://code.activestate.com/pypm/orchid/
open source web crawler
http://scrapy.org/
tutorials
http://www.example-code.com/python/pythonspider.asp
PS I don't know if they use mysql because normally python either uses sqlit or postgre sql so if you want you could use the libraries i gave you and import the python-mysql module and do it :D
http://sourceforge.net/projects/mysql-python/
Related Topics
Split a List into Nested Lists on a Value
Python Worker Failed to Connect Back
Python How to Read N Number of Lines at a Time
Python: Sort Function Breaks in the Presence of Nan
Assigning String with Boolean Expression
Python Webdriver to Handle Pop Up Browser Windows Which Is Not an Alert
Embedding Ipython Qt Console in a Pyqt Application
How to Install a Python Package from Within Ipython
What's a Good Equivalent to Subprocess.Check_Call That Returns the Contents of Stdout
Weird Behavior: Lambda Inside List Comprehension
How to Highlight Specific X-Value Ranges
Python: Sorting Items from Top Left to Bottom Right with Opencv
Pyspark: Explode JSON in Column to Multiple Columns
Generating Sublists Using Multiplication ( * ) Unexpected Behavior
Screenshot of Inactive Window Printwindow + Win32Gui
Matplotlib Semi-Log Plot: Minor Tick Marks Are Gone When Range Is Large