How do I write a web scraper in Ruby?
Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.
Scraping content from html page
Assuming the code you have is giving you what you want, I wouldn't try to rewrite anything significant. You can be more brief and idiomatic by replacing your #each
methods with #map
:
data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)
web scraping/export to CSV with Ruby
The problem is that you're doing CSV.open
for every single item. So it's overwriting the file with the newer item. And hence at the end, you're left with the last item in the csv file.
Move the CSV.open
call before page.css('.item').each
and it should work.
CSV.open("file.csv", "wb") do |csv|
page.css('.item').each do |item|
name = item.at_css('a').text
link = item.at_css('a')[:href]
csv << [name, link]
end
end
How can i write a never ending job in Rails (Web Scraping)?
Sidekiq is designed to run individual jobs which are "units of work" to your organization.
You can build your own loop and, inside that loop, create jobs for each page to scrape but the loop itself should not be a job.
Web Scraping using Ruby - If statment
page.css('span.first_detail_cell').each do |line|
if line.text.include?("Furnished")
# do something hre
else
beds << line.text
end
end
Web Scraping with Nokogiri::HTML and Ruby - How to get output into an array?
Starting with the HTML:
html = '
<div class="compatible_vehicles">
<div class="heading">
<h3>Compatible Vehicles</h3>
</div><!-- .heading -->
<ul>
<li>
<p class="label">Type1</p>
<p class="data">All</p>
</li>
<li>
<p class="label">Type2</p>
<p class="data">All</p>
</li>
<li>
<p class="label">Type3</p>
<p class="data">All</p>
</li>
<li>
<p class="label">Type4</p>
<p class="data">All</p>
</li>
<li>
<p class="label">Type5</p>
<p class="data">All</p>
</li>
</ul>
</div><!-- .compatible_vehicles -->
'
Parsing it with Nokogiri and looping over the <li>
tags to get their <p>
tag contents:
require 'nokogiri'
doc = Nokogiri::HTML(html)
data = doc.search('.compatible_vehicles li').map{ |li|
li.search('p').map { |p| p.text }
}
Returns an array of arrays:
=> [["Type1", "All"], ["Type2", "All"], ["Type3", "All"], ["Type4", "All"], ["Type5", "All"]]
From there you should be able to plug that into the examples for the CSV class and get it to work with no trouble.
Now, compare your code to output to the fields to the screen to this:
data.map{ |a| a.join(' - ') }.join(', ')
=> "Type1 - All, Type2 - All, Type3 - All, Type4 - All, Type5 - All"
All I'd have to do is puts
and it'd print correctly.
It's really important to think about returning useful data structures. In Ruby, hashes and arrays are very useful, because we can iterate over them and massage them into many forms. It'd be trivial, from the array of arrays, to create a hash:
Hash[data]
=> {"Type1"=>"All", "Type2"=>"All", "Type3"=>"All", "Type4"=>"All", "Type5"=>"All"}
Which would make it really easy to do lookups.
How to scrape a web page with dynamic content added by JavaScript?
To get lazy loaded page, scrap the following pages:
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=31&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=46&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=61&ajax=true
...
require 'rubygems'
require 'nokogiri'
require 'mechanize'
require 'open-uri'
number = 1
while true
url = "http://www.flipkart.com/mens-footwear/shoes" +
"/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&" +
"sid=osp%2Ccil%2Cnit%2Ce1f&start=#{number}&ajax=true"
doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)
products = doc.css(".browse-product")
break if products.size == 0
products.each do |item|
title = item.at_css(".fk-display-block,.title").text.strip
price = (item.at_css(".pu-final").text || '').strip
link = item.at_xpath(".//a[@class='fk-display-block']/@href")
image = item.at_xpath(".//div/a/img/@src")
puts number
puts "#{title} - #{price}"
puts "http://www.flipkart.com#{link}"
puts image
puts "========================"
number += 1
end
end
Related Topics
Ruby's <=> Operator and Sort Method
Error Installing Gem Capybara-Webkit
Raise Custom Exception with Arguments
Break and Return in Ruby, How to Use Them
How to Dump an Http Request from Within Sinatra
Return Two and More Values from a Method
Finding Nil Has_One Associations in Where Query
How to Set an Attr_Accessor for a Dynamic Instance Variable
Restoring Rails 3's Bundle Install Path... It's Now Install in My Root
Failed to Build Gem Native Extension When Install Redcloth-4.2.9 Install Linux
How to Redefine a Ruby Constant Without Warning
Ruby: Module, Require and Include