Web page scraping gems/tools available in Ruby
There are so many scraping gems
available in Ruby
like Hpricot, Nokogiri and so many. I recommend Nokogiri
to scrape static web pages
. If you are scraping dynamic web pages
(means which involves button click, submit form etc..). I recommend Mechanize which internally uses Nokogiri
.
What are some good Ruby-based web crawlers?
I am building wombat, a Ruby DSL to crawl web pages and extract content. Check it out on github https://github.com/felipecsl/wombat
It is still in an early stage but is already functional with basic functionality. More stuff will be added really soon.
How can I scrape, parse and crawl files in Ruby?
Your question focuses on a lot on "low level" details -- parsing URL's and so on. One key aspect of the "Ruby Way" is "Don't reinvent the wheel." Leverage existing libraries. :)
My recommendation? First, leverage a crawler such as spider or anemone. Second, use Nokogiri for HTML/XML parsing. Third, store the results. I recommend this because you might do different analyses later and you don't want to throw away the hard work of your spidering.
Without knowing too much about your constraints, I would look at storing your results in MongoDB. After thinking this, I did a quick search and found a nice tutorial Scraping a blog with Anemone and MongoDB.
Node.js or Ruby for Scraping
You can utilize the capybara
gem for scraping javascript sites using ruby.
This has the advantage of being able to use actual browsers such as Firefox, Chrome and IE through the selenium
driver. Or you can use headless browsers such as webkit (via capybara-webkit) or phantomjs (via poltergeist).
When you use capybara, just be sure to use a javascript enabled driver, such as selenium or capybara-webkit. My driver of the day is poltergeist.
There are some instructions for how to use capybara with remote sites in their readme.
Node vs. Ruby is a very open ended question. My answer here is suggesting Ruby because that is my experience and preference. "Combining" them could mean many things, they can be used in concert, each playing to their strengths.
Web crawler in Rails to extract links and download files from web page
Have a look at Nokogiri aswell.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.thatwebsite.com/downloads'))
doc.css('a').each do |link|
if link['href'] =~ /\b.+.pdf/
begin
File.open('filename_to_save_to.pdf', 'wb') do |file|
downloaded_file = open(link['href'])
file.write(downloaded_file.read())
end
rescue => ex
puts "Something went wrong...."
end
end
end
You might want to do some better exception catching, but I think you get the idea :)
Scraping data from tables on multiple web pages in R (football players)
Here's how you can easily get all the data in all the tables on all the player pages...
First make a list of the URLs for all the players' pages...
require(RCurl); require(XML)
n <- length(letters)
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
print(i) # keep track of what the function is up to
# get all html on each page of the a-z index pages
inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
# scrape URLs for each player from each index page
lnk <- unname(xpathSApply(inx_page, "//a/@href"))
# skip first 63 and last 10 links as they are constant on each page
lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
# only keep links that go to players (exclude schools)
lnk <- lnk[grep("players", lnk)]
# now we have a list of all the URLs to all the players on that index page
# but the URLs are incomplete, so let's complete them so we can use them from
# anywhere
links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)
Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:
Second, scrape all the tables at each URL to get their data, like so:
# Go to each URL in the list and scrape all the data from the tables
# this will take some time... don't interrupt it!
# start edit1 here - just so you can see what's changed
# pre-allocate list
all_tables <- vector("list", length = (length(links)))
for(i in 1:length(links)){
print(i)
# error handling - skips to next URL if it gets an error
result <- try(
all_tables[[i]] <- readHTMLTable(links[i], stringsAsFactors = FALSE)
); if(class(result) == "try-error") next;
}
# end edit1 here
# Put player names in the list so we know who the data belong to
# extract names from the URLs to their stats page...
toMatch <- c("http://www.sports-reference.com/cfb/players/", "-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"), "", links))
# assign player names to list of tables
names(all_tables) <- player_names
The result looks like this (this is just a snippet of the output):
all_tables
$`neli-aasa`
$`neli-aasa`$defense
Year School Conf Class Pos Solo Ast Tot Loss Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007 Utah MWC FR DL 2 1 3 0.0 0.0 0 0 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 4 4 8 2.5 1.5 0 0 0 1 0 0 0 0
$`neli-aasa`$kick_ret
Year School Conf Class Pos Ret Yds Avg TD Ret Yds Avg TD
1 *2007 Utah MWC FR DL 0 0 0 0 0 0
2 *2010 Utah MWC SR DL 2 24 12.0 0 0 0 0
$`neli-aasa`$receiving
Year School Conf Class Pos Rec Yds Avg TD Att Yds Avg TD Plays Yds Avg TD
1 *2007 Utah MWC FR DL 1 41 41.0 0 0 0 0 1 41 41.0 0
2 *2010 Utah MWC SR DL 0 0 0 0 0 0 0 0 0
Finally, let's say we just want to look at the passing tables...
# just show passing tables
passing <- lapply(all_tables, function(i) i$passing)
# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)
And we end up with a data frame that is ready for further analyses (also just a snippet)...
Year School Conf Class Pos Cmp Att Pct Yds Y/A AY/A TD Int Rate
james-aaron 1978 Air Force Ind QB 28 56 50.0 316 5.6 3.6 1 3 92.6
jeff-aaron.1 2000 Alabama-Birmingham CUSA JR QB 100 182 54.9 1135 6.2 6.0 5 3 113.1
jeff-aaron.2 2001 Alabama-Birmingham CUSA SR QB 77 148 52.0 828 5.6 4.3 4 6 99.8
Related Topics
How to Figure Out Which Step I'Ve Just Executed in Cucumber's Afterstep Hook
How to Parse Xml Nodes to CSV with Ruby and Nokogiri
Ruby on Rails: Params Is Nil. Undefined Method '[]' for Nil:Nilclass
How to Include Ё in [А-Я] Regexp Char Interval
Building a Simple Search Form in Rails
Can't Setup Ruby Environment - Installing Fii Gem Error
Why Is 032 Different Than 32 in Ruby
Show Markers on Google Maps Dynamically -Rails 3.2
Validate That String Contains Only Allowed Characters in Ruby
Parsing Large Xml with Nokogiri
How to Create an Operator for Deep Copy/Cloning of Objects in Ruby
Does Anyone Have Parsing Rules for the Notepad++ Function List Plugin for Ruby and Rake
How to Simplify or Clean Up This Anagram Method
How to Get a Reference to a Method
Minitest, Test::Unit, and Rails