How to Get Content from a Website Using Ruby/Rails

How do I get content from a website using Ruby / Rails?

This isn't really a Rails question. It's something you'd do using Ruby, then possibly display using Rails, or Sinatra or Padrino - pick your poison.

There are several different HTTP clients you can use:

Open-URI comes with Ruby and is the easiest. Net::HTTP comes with Ruby and is the standard toolbox, but it's lower-level so you'd have to do more work. HTTPClient and Typhoeus+Hydra are capable of threading and have both high-level and low-level interfaces.

I recommend using Nokogiri to parse the returned HTML. It's very full-featured and robust.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.example.com'))

puts doc.to_html

If you need to navigate through login screens or fill in forms before you get to the page you need to parse, then I'd recommend looking at Mechanize. It relies on Nokogiri internally so you can ask it for a Nokogiri document and parse away once Mechanize retrieves the desired URL.

If you need to deal with Dynamic HTML, then look into the various WATIR tools. They drive various web browsers then let you access the content as seen by the browser.

Once you have the content or data you want, you can "repurpose" it into text inside a Rails page.

How to scrape data from another website using Rails 3

I'd recommend a combination of Nokogiri and open-uri. Require both gems, and then just do something along the lines of doc = Nokogiri::HTML(open(YOUR_URL)). Then find the element you want to capture (using developer tools in chrome (or the equivalent) or something like Selector Gadget. Then you can use doc.at_css(SELECTOR) for a single element, or doc.search(SELECTOR) for multiple selectors. Calling the text method the response should get you the product description you're looking for. No need to save anything to the database (unless you want to) Hope that helps!

Get the html from a website with ruby on rails

You can use httparty to just get the data

Sample code (from example):

require File.join(dir, 'httparty')
require 'pp'

class Google
include HTTParty
format :html
end

# google.com redirects to www.google.com so this is live test for redirection
pp Google.get('http://google.com')

puts '', '*'*70, ''

# check that ssl is requesting right
pp Google.get('https://www.google.com')

Nokogiri really excels at parsing that data.. Here's some example code from the Railscast:

url = "http://www.walmart.com/search/search-ng.do?search_constraint=0&ic=48_0&search_query=batman&Find.x=0&Find.y=0&Find=Find"
doc = Nokogiri::HTML(open(url))
puts doc.at_css("title").text
doc.css(".item").each do |item|
title = item.at_css(".prodLink").text
price = item.at_css(".PriceCompare .BodyS, .PriceXLBold").text[/\$[0-9\.]+/]
puts "#{title} - #{price}"
puts item.at_css(".prodLink")[:href]
end

ruby code to search and get a string from a html content

key = get()[/commit\s+([a-f0-9]{10,})/i, 1]
puts key

Regex explanation here.

How to get the HTML source of a webpage in Ruby

Use Net::HTTP:

require 'net/http'

source = Net::HTTP.get('stackoverflow.com', '/index.html')


Related Topics



Leave a reply



Submit