Web Page Scraping Gems/Tools Available in Ruby

Web page scraping gems/tools available in Ruby

There are so many scraping gems available in Ruby like Hpricot, Nokogiri and so many. I recommend Nokogiri to scrape static web pages. If you are scraping dynamic web pages (means which involves button click, submit form etc..). I recommend Mechanize which internally uses Nokogiri.

What are some good Ruby-based web crawlers?

I am building wombat, a Ruby DSL to crawl web pages and extract content. Check it out on github https://github.com/felipecsl/wombat

It is still in an early stage but is already functional with basic functionality. More stuff will be added really soon.

How can I scrape, parse and crawl files in Ruby?

Your question focuses on a lot on "low level" details -- parsing URL's and so on. One key aspect of the "Ruby Way" is "Don't reinvent the wheel." Leverage existing libraries. :)

My recommendation? First, leverage a crawler such as spider or anemone. Second, use Nokogiri for HTML/XML parsing. Third, store the results. I recommend this because you might do different analyses later and you don't want to throw away the hard work of your spidering.

Without knowing too much about your constraints, I would look at storing your results in MongoDB. After thinking this, I did a quick search and found a nice tutorial Scraping a blog with Anemone and MongoDB.

Node.js or Ruby for Scraping

You can utilize the capybara gem for scraping javascript sites using ruby.

This has the advantage of being able to use actual browsers such as Firefox, Chrome and IE through the selenium driver. Or you can use headless browsers such as webkit (via capybara-webkit) or phantomjs (via poltergeist).

When you use capybara, just be sure to use a javascript enabled driver, such as selenium or capybara-webkit. My driver of the day is poltergeist.

There are some instructions for how to use capybara with remote sites in their readme.

Node vs. Ruby is a very open ended question. My answer here is suggesting Ruby because that is my experience and preference. "Combining" them could mean many things, they can be used in concert, each playing to their strengths.

Web crawler in Rails to extract links and download files from web page

Have a look at Nokogiri aswell.

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.thatwebsite.com/downloads'))

doc.css('a').each do |link|
  if link['href'] =~ /\b.+.pdf/
    begin
      File.open('filename_to_save_to.pdf', 'wb') do |file|
        downloaded_file = open(link['href'])
        file.write(downloaded_file.read())
      end
    rescue => ex
      puts "Something went wrong...."
    end
  end
end

You might want to do some better exception catching, but I think you get the idea :)

Scraping data from tables on multiple web pages in R (football players)

Here's how you can easily get all the data in all the tables on all the player pages...

First make a list of the URLs for all the players' pages...

require(RCurl); require(XML)
n <- length(letters) 
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
  print(i) # keep track of what the function is up to
  # get all html on each page of the a-z index pages
  inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
  # scrape URLs for each player from each index page
  lnk <- unname(xpathSApply(inx_page, "//a/@href"))
  # skip first 63 and last 10 links as they are constant on each page
  lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
  # only keep links that go to players (exclude schools)
  lnk <- lnk[grep("players", lnk)]
  # now we have a list of all the URLs to all the players on that index page
  # but the URLs are incomplete, so let's complete them so we can use them from 
  # anywhere
  links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)

Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:

Second, scrape all the tables at each URL to get their data, like so:

# Go to each URL in the list and scrape all the data from the tables
# this will take some time... don't interrupt it!
# start edit1 here - just so you can see what's changed
    # pre-allocate list
all_tables <- vector("list", length = (length(links)))
for(i in 1:length(links)){
  print(i)
  # error handling - skips to next URL if it gets an error
  result <- try(
    all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
  ); if(class(result) == "try-error") next;
}
# end edit1 here
# Put player names in the list so we know who the data belong to
# extract names from the URLs to their stats page...
toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
# assign player names to list of tables
names(all_tables) <- player_names

The result looks like this (this is just a snippet of the output):

all_tables
$`neli-aasa`
$`neli-aasa`$defense
   Year School Conf Class Pos Solo Ast Tot Loss  Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007   Utah  MWC    FR  DL    2   1   3  0.0 0.0   0   0      0  0  0   0  0  0
2 *2010   Utah  MWC    SR  DL    4   4   8  2.5 1.5   0   0      0  1  0   0  0  0

$`neli-aasa`$kick_ret
   Year School Conf Class Pos Ret Yds  Avg TD Ret Yds Avg TD
1 *2007   Utah  MWC    FR  DL   0   0       0   0   0      0
2 *2010   Utah  MWC    SR  DL   2  24 12.0  0   0   0      0

$`neli-aasa`$receiving
   Year School Conf Class Pos Rec Yds  Avg TD Att Yds Avg TD Plays Yds  Avg TD
1 *2007   Utah  MWC    FR  DL   1  41 41.0  0   0   0      0     1  41 41.0  0
2 *2010   Utah  MWC    SR  DL   0   0       0   0   0      0     0   0       0

Finally, let's say we just want to look at the passing tables...

# just show passing tables
passing <- lapply(all_tables, function(i) i$passing)
# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)

And we end up with a data frame that is ready for further analyses (also just a snippet)...

             Year             School Conf Class Pos Cmp Att  Pct  Yds Y/A AY/A TD Int  Rate
james-aaron  1978          Air Force  Ind        QB  28  56 50.0  316 5.6  3.6  1   3  92.6
jeff-aaron.1 2000 Alabama-Birmingham CUSA    JR  QB 100 182 54.9 1135 6.2  6.0  5   3 113.1
jeff-aaron.2 2001 Alabama-Birmingham CUSA    SR  QB  77 148 52.0  828 5.6  4.3  4   6  99.8

Web Page Scraping Gems/Tools Available in Ruby