How to Write a Web Scraper in Ruby

How do I write a web scraper in Ruby?

Your best bet would be to use Mechanize.It can follow links, submit forms, anything you will need, web client-wise. By the way, don't use regexes to parse HTML. Use an HTML parser.

Scraping content from html page

Assuming the code you have is giving you what you want, I wouldn't try to rewrite anything significant. You can be more brief and idiomatic by replacing your #each methods with #map:

data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)

web scraping/export to CSV with Ruby

The problem is that you're doing CSV.open for every single item. So it's overwriting the file with the newer item. And hence at the end, you're left with the last item in the csv file.

Move the CSV.open call before page.css('.item').each and it should work.

CSV.open("file.csv", "wb") do |csv|
page.css('.item').each do |item|
name = item.at_css('a').text
link = item.at_css('a')[:href]
csv << [name, link]
end
end

How can i write a never ending job in Rails (Web Scraping)?

Sidekiq is designed to run individual jobs which are "units of work" to your organization.

You can build your own loop and, inside that loop, create jobs for each page to scrape but the loop itself should not be a job.

Web Scraping using Ruby - If statment

  page.css('span.first_detail_cell').each do |line|
if line.text.include?("Furnished")
# do something hre
else
beds << line.text
end
end

Web Scraping with Nokogiri::HTML and Ruby - How to get output into an array?

Starting with the HTML:

html = '
<div class="compatible_vehicles">
<div class="heading">
<h3>Compatible Vehicles</h3>
</div><!-- .heading -->
<ul>
<li>
<p class="label">Type1</p>
<p class="data">All</p>
</li>
<li>
<p class="label">Type2</p>
<p class="data">All</p>
</li>
<li>
<p class="label">Type3</p>
<p class="data">All</p>
</li>
<li>
<p class="label">Type4</p>
<p class="data">All</p>
</li>
<li>
<p class="label">Type5</p>
<p class="data">All</p>
</li>
</ul>
</div><!-- .compatible_vehicles -->
'

Parsing it with Nokogiri and looping over the <li> tags to get their <p> tag contents:

require 'nokogiri'

doc = Nokogiri::HTML(html)
data = doc.search('.compatible_vehicles li').map{ |li|
li.search('p').map { |p| p.text }
}

Returns an array of arrays:

=> [["Type1", "All"], ["Type2", "All"], ["Type3", "All"], ["Type4", "All"], ["Type5", "All"]]

From there you should be able to plug that into the examples for the CSV class and get it to work with no trouble.

Now, compare your code to output to the fields to the screen to this:

data.map{ |a| a.join(' - ') }.join(', ')
=> "Type1 - All, Type2 - All, Type3 - All, Type4 - All, Type5 - All"

All I'd have to do is puts and it'd print correctly.

It's really important to think about returning useful data structures. In Ruby, hashes and arrays are very useful, because we can iterate over them and massage them into many forms. It'd be trivial, from the array of arrays, to create a hash:

Hash[data]
=> {"Type1"=>"All", "Type2"=>"All", "Type3"=>"All", "Type4"=>"All", "Type5"=>"All"}

Which would make it really easy to do lookups.

How to scrape a web page with dynamic content added by JavaScript?

To get lazy loaded page, scrap the following pages:

http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=31&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=46&ajax=true
http://www.flipkart.com/mens-footwear/shoes/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&sid=osp%2Ccil%2Cnit%2Ce1f&start=61&ajax=true
...

require 'rubygems'
require 'nokogiri'
require 'mechanize'
require 'open-uri'

number = 1
while true
url = "http://www.flipkart.com/mens-footwear/shoes" +
"/casual-shoes/pr?p%5B%5D=sort%3Dpopularity&" +
"sid=osp%2Ccil%2Cnit%2Ce1f&start=#{number}&ajax=true"

doc = Nokogiri::HTML(open(url))
doc = Nokogiri::HTML(doc.at_css('#ajax').text)

products = doc.css(".browse-product")
break if products.size == 0

products.each do |item|
title = item.at_css(".fk-display-block,.title").text.strip
price = (item.at_css(".pu-final").text || '').strip
link = item.at_xpath(".//a[@class='fk-display-block']/@href")
image = item.at_xpath(".//div/a/img/@src")

puts number
puts "#{title} - #{price}"
puts "http://www.flipkart.com#{link}"
puts image
puts "========================"

number += 1
end
end


Related Topics



Leave a reply



Submit