What Are Some Good Ruby-Based Web Crawlers

What are some good Ruby-based web crawlers?

I am building wombat, a Ruby DSL to crawl web pages and extract content. Check it out on github https://github.com/felipecsl/wombat

It is still in an early stage but is already functional with basic functionality. More stuff will be added really soon.

Web crawler in ruby

If you want just to get pages' content, the simpliest way is to use open-uri functions. They don't require additional gems. You just have to require 'open-uri' and... http://ruby-doc.org/stdlib-2.2.2/libdoc/open-uri/rdoc/OpenURI.html

To parse content you can use Nokogiri or other gems, which also can have, for example, useful XPATH-technology. You can find other parsing libraries just here on SO.

Creating a Web Crawler in Ruby. How to tackle performance issues?

Look at using Typhoeus and Hydra. They'll make it easy to process the URLs
in parallel.

You don't need to use Mechanize, unless you have to request special data from
each page. For a normal crawler you can grab the body and parse it using
Open::URI and Nokogiri without Mechanize's overhead or added functionality. For
your purpose, substitute Typhoeus for Open::URI and let Hydra handle the
threads management.

Remember, crawling 200k websites is going to saturate your bandwidth if you try
to do them all at once. That'll make your Rails site unavailable, so you need
to throttle your requests. And, that means you will have to do them over
several (or many) hours. Speed isn't as important as keeping your site online
here. I'd probably put the crawler on a separate machine from the Rails server
and let the database tie things together.

Create a table or file that contains the site URLs you are crawling. I'd
recommend the table so you can put together a form to edit/manage the URLs.
You'll want to track things like:

  • The last time a URL was crawled. (DateTime)
  • Whether you should crawl a particular URL (Boolean or char1)
  • URL (String or var char[1024] should be fine). This should be a unique key.
  • Whether that URL is currently being crawled (Boolean or char1). This is cleared at the start of a run for all records, then set and left when a spider goes to load that page.
  • A field showing when what days it's ok to run that site.
  • A field showing what hours it's OK to run that site.

The last two are important. You don't want to crawl a little site that is
underpowered and kill its connection. That's a great way to get banned.

Create another table that is the next URL to check on a particular site
gathered from the links you encounter while crawling. You'll want to come up
with a normalization routine to reduce a URL with session data and parameters
to something you can use to test for uniqueness. In this new table you'll want
URLs to be unique so you don't get into a loop or keep adding the same page
with different parameters.

You might want to pay attention to the actual landing-URL retrieved after any
redirects instead of the "get" URL, because redirects and DNS names could vary
inside a site and the people generating the content could be using different
host names. Similarly, you might want to look for meta-redirects in the head
block and follow them. These are a particularly irritating aspect of doing what
you want to write.

As you encounter new URLs check to see if they are exiting URLs, that would
cause you to leave that site if you followed them. If so, don't add them to
your URL table.

It's probably not going to help to write the database information to files,
because to locate the right file you'll probably need to do a database search
anyway. Just store what you need in a field and request it directly. 200K rows
is nothing in a database.

Pay attention to the "spider" rules for sites and if they offer an API to get
at the data, then use it, instead of crawling.

Crawling a large site, handling timeouts

You definitely should split your parser routine, and save temporary data into DB simultaneously.

My approach would be:

  1. Crawl Tier 1 to gather categories. Save them into temporary DB.
  2. Using the DB, crawl Tier 2 to gather list of topics. Save them into DB.
  3. Using the DB, crawl Tier 3 to fetch actual contents. Save them into DB, skip/retry if error occurs.

My attempts at building the simplest web crawler w/Capybara are failing. What am I doing wrong?

selenium-webdriver recently released 3.0.0 which defaults to using geckodriver with firefox (which Capybara defaults to), but has some missing functionality in that combination. Rather I would recommend using it with chrome and chromedriver for your use case. You will need to download the latest version of chromedriver and put it somewhere in your PATH. Then

require "capybara/dsl"
require "selenium-webdriver"

Capybara.register_driver :crawler_driver do |app|
Capybara::Selenium::Driver.new(app, :browser => :chrome)
end
Capybara.default_driver = :crawler_driver

class Crawler
include Capybara::DSL

def initialize
visit "http://www.google.com"
end
end

crawler = Crawler.new

should do what you're trying to do. You're going to have issues as soon as you create another Crawler instance though since they will both be using the same Capybara session and conflict. If you're not going to be creating multiple instance then you're fine, if you are then you'll want to create a new Capybara::Session in each instance of crawler and call all capybara methods on that session object rather than including Capybara::DSL into your object which would be more like this

class Crawler
def initialize
@session = Capybara::Session.new(:crawler_driver)
@session.visit "http://www.google.com"
end
end

How to write a crawler in ruby?

There are couple of options depending upon your usecase.

  • Nokogiri. Here is the RailsCast that will get you started.
  • Mechanize is built on top of Nokogiri. See the Mechanize RailsCast.
  • Screen Scraping with ScrAPI and the ScrAPI RailsCast.
  • Hpricot.

I have used combination of Nokogiri and Mechanize for few of my projects and I think they are good options.

Ruby alternative to Scrapy?

There's Mechanize which is built upon Nokogiri.

There's Nokigiri which is based on XPath.

Hpricot is another tool.

There's Scrapi which is based on CSS selectors to extract information, but performs slower than Nokogiri based on my testing.

There's scRUBYt.

I'm sure there are others, but these are the ones that I came across.

If you don't find a single tool that solves your problems, checkout web spidering libraries like Anemone and combine it with one of the low-level scraping frameworks listed above.

Or just go ahead and learn Python. It'll expand your karma in the programming world.



Related Topics



Leave a reply



Submit