Extract Single String from HTML Using Ruby/Mechanize (And Nokogiri)

extract single string from HTML using Ruby/Mechanize (and Nokogiri)

Radek. I'm going to show you how to fish.

When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:

puts  post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip

start with this:

puts post_page.parser.xpath('//table').to_html

This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:

puts post_page.parser.xpath("//table[@class='userdata']").to_html

Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:

puts post_page.parser.xpath("//table[@class='userdata']//tr").to_html

If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.

And that's how you do it.

How to extract text from script tag by using nokogiri and mechanize?

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://www.booking.com/hotel/us/solera-by-stay-alfred.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmcgV1c19ueYgBAZgBMbgBBMgBBNgBAegBAfgBAg;sid=695d6598485cb1a8fd9e39c5de3878ba;dcid=4;checkin=2015-10-20;checkout=2015-10-21;dist=0;group_adults=2;room1=A%2CA;sb_price_type=total;srfid=cf5d76283b73d34a1d7e0d61cad6974e38a94351X1;type=total;ucfs=1&')

match = agent.page.search("script").text.scan(/^booking.env.b_hotel_id = \'.*\'/)
puts match
puts match[0].split("'")[1]

Output:

booking.env.b_hotel_id = '1202411'
1202411

Pages that helped me figure this out:

http://robdodson.me/crawling-pages-with-mechanize-and-nokogiri/

Parsing javascript function elements with nokogiri

Regular expression - starting and ending with a character string

http://www.rubular.com

How to let Ruby Mechanize get a page which lives in a string

Mechanize uses Nokogiri to parse the HTML. If you are accessing the HTML without the need of an internet transfer protocol you don't need Mechanize. All you are looking to do is to parse the input HTML, right?

The following will let you do this:

require 'Nokogiri'
html = 'html here'
page = Nokogiri::HTML html

If you have the Mechanize gem installed you will already have Nokogiri.

Otherwise you can still create a new Mechanize page using:

require 'Mechanize'
html = 'html here'
a = Mechanize.new
page2 = Mechanize::Page.new(nil,{'content-type'=>'text/html'},html,nil,a)

How to scrape script tags with Nokogiri and Mechanize

Mechanize is overkill if all you are using it for is to retrieve a page. There are many HTTP client gems that'll easily do that, or use OpenURI which is part of Ruby's standard library.

This is the basics for retrieving the information. You'll need to figure out which particular script you want but Nokogiri's tutorials will give you the basics:

require 'json'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('https://www.foodpantries.org/li/st._paul_the_apostle_11368'))

At this point Nokogiri has a DOM created of the page in memory.

Find the <script> node you want, and extract the text of the node:

js = doc.at('script[type="application/ld+json"]').text

at and search are the workhorses for parsing a page. There are CSS and XPath specific variants, but generally you can use the generic versions and Nokogiri will figure out which to use. All are documented on the same page as at and search and the tutorials.

JSON is smart and allows us to use a shorthand of JSON[...] to parse or generate a JSON string. In this case it's parsing a string back into a Ruby object, which in this instance is a hash:

JSON[js]
# => {"@context"=>"https://schema.org",
# "@type"=>"Organization",
# "url"=>"https://www.foodpantries.org/",
# "sameAs"=>[],
# "contactPoint"=>
# [{"@type"=>"ContactPoint",
# "contactType"=>"customer service",
# "url"=>"https://www.foodpantries.org/ar/about",
# "email"=>"webmaster@foodpantries.org"}]}

Accessing a particular key/value pair is simple, just as with any other hash:

foo = JSON[js]
foo['url'] # => "https://www.foodpantries.org/"

The page you're referring to has multiple scripts that match the selector I used, so you'll want to filter using a more exact selector, or iterate over the matches and pick the one you want. How to do that is well documented here on SO using CSS, XPath and by Nokogiri's documentation.

How to get links using mechanize and nokogiri ruby

You could go through and separate the data like "How to split a HTML document using Nokogiri?" but if you know what the tag is going to be you could just split it:

# html is the raw html string
html.split('<h4').map{|g| Nokogiri::HTML::DocumentFragment.parse(g).css('a') }

page = Nokogiri::HTML(html).css("#right_holder")
links = page.children.inject([]) do |link_hash, child|
if child.name == 'h4'
name = child.text
link_hash << { :name => name, :content => ""}
end

next link_hash if link_hash.empty?
link_hash.last[:content] << child.to_xhtml
link_hash
end

grouped_hsh = links.inject({}) do |hsh, link|
hsh[link[:name]] = Nokogiri::HTML::DocumentFragment.parse(link[:content]).css('a')
hsh
end

# {"Some text"=>[#<Nokogiri::XML::Element:0x3ff4860d6c30,
# "Some more text"=>[#<Nokogiri::XML::Element:0x3ff486096c20...,
# "Some additional text"=>[#<Nokogiri::XML::Element:0x3ff486f2de78...}

Extract data from HTML Table with mechanize

More succint version relying more on the black magic of XPath :)

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.alpineascents.com/8000m-peaks.asp'))
last_td = doc./("//tr[td[strong[text()='#{ARGV[0]}']]]/td[5]")

puts last_td.text.gsub(/.*?;/, '').strip

How to extract text after br using Mechanize

It's easy, but you have to understand how a document is represented inside Nokogiri in the DOM.

There are tags, which are Element nodes, and the intervening text, which are Text nodes:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div class="location">
Country
<br>
State
<br>
City
</div>
EOT

doc.at('.location br').next_sibling.text.strip # => "State"

Here's what Nokogiri says <br> is:

doc.at('.location br').class # => Nokogiri::XML::Element

And the following Text node:

doc.at('.location br').next_sibling.class # => Nokogiri::XML::Text

And how we access the content of the text node:

doc.at('.location br').next_sibling.text # => "\n    State\n    "

And again, looking at the <div> tag and its next sibling node:

doc.at('.location').class # => Nokogiri::XML::Element
doc.at('.location').next_sibling.class # => Nokogiri::XML::Text
doc.at('.location').next_sibling # => #<Nokogiri::XML::Text:0x3fcf58489c7c "\n">

By the way, you can access Mechanize's Nokogiri parser to play with the DOM using something like:

require 'mechanize'

agent = Mechanize.new
page = agent.get('http://example.com')
doc = page.parser

doc.class # => Nokogiri::HTML::Document
doc.title # => "Example Domain"

I can't do like this doc.at('.location br br').next_sibling.text or doc.at('.location br').next_sibling.next_sibling.text

The first assertion is correct, you can't use '.location br br' because you can't nest a tag inside a <br>, so br br is nonsense when writing a CSS selector for HTML.

The second assertion is wrong. You can use next_sibling.next_sibling but you have to be aware of the tags in the DOM. In your HTML example it doesn't return anything sensible:

doc.at('.location br').to_html # => "<br>"
doc.at('.location br').next_sibling.to_html # => "\n State\n "
doc.at('.location br').next_sibling.next_sibling.to_html # => "<br>"

And getting the text of <br> would return an empty string since <br> can't wrap text:

doc.at('br').text # => ""

So, you just didn't go far enough:

doc.at('.location br').next_sibling.next_sibling.next_sibling.text.strip # => "City"

But, if that's the intention with this DOM I'd do it more simply:

break_text = doc.search('.location br').map{ |br| br.next_sibling.text.strip }
# => ["State", "City"]

Loop over all the dd tags and extract specefic information via Mechanize/Nokogiri

Here's an example of how you could parse the bold text and href attribute from the anchor elements you describe:

require 'nokogiri'
require 'open-uri'

url = 'http://openie.allenai.org/sentences/?rel=contains&arg2=antioxidant&title=Green%20tea'
doc = Nokogiri::HTML(open(url))

doc.xpath('//dd/*/a').each do |a|
text = a.xpath('.//b').map {|b| b.text.gsub(/\s+/, ' ').strip}
href = a['href']
puts "OK: text=#{text.inspect}, href=#{href.inspect}"
end

# OK: text=["Green tea", "many antioxidants"], href="http://www.talbottteas.com/category_s/55.htm"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.skin-care-experts.com/tag/best-skin-care/page/4"
# OK: text=["Green tea", "potent antioxidants"], href="http://www.specialitybrand.com/news/view/207.html"

In a nutshell, this solution uses XPath in two places:

  1. Initially to find every a element underneath each dd element.
  2. Then to find each b element inside of the as in #1 above.

The final trick is cleaning up the text within the "b" elements into something presentable, of course, you might want it to look different somehow.

How to parse only part of a string-value from an element using Nokogiri? RUBY, Mechanize

Give some HTML inside html, you'd do something like this:

doc     = Nokogiri::HTML(html)
numbers = doc.xpath('//p[@title]').collect { |p| p[:title].gsub(/[^\d]/, '') }

Then you'll have the numbers in the numbers array. You'll have to adjust the XPath and regular expression to match your real data of course but the basic technique should be clear.

A bit of time with the Nokogiri documentation and tutorials might be fruitful.



Related Topics



Leave a reply



Submit